ROOT CAUSE ANALYSIS REPORT
Informatica Customers on APP2 Only
ICS, ICRT, DQ RADAR, B2B
INCIDENT START/END: June 20, 2017 22:35 PDT – June 21, 2017 15:15 PDT
RCA COMPLETED: 6/22/2017
The ICS product hosted on “APP2” became unavailable to handle customer traffic via secure agents caused by a “channel issue due to insufficient memory”. During the Tuesday portion, service was intermittent and then escalated into full outage during Wednesday. This caused other dependent services that rely on ICS, including ICRT, DQ Radar and API Gateway to be unavailable. The customer impact was inability to process any jobs through the Informatica Cloud Services. The root cause of this incident was server memory / swap space exhaustion on the channel server nodes. This caused the service response to be intermittent or unavailable, since the agent to host communication was impacted.
ROOT CAUSE ANALYSIS
The memory footprint on the channel server nodes have increased after the recent Informatica Cloud upgrades as part of new functionality, but within the limits of tolerance. There was errant backup process that started consuming 50% of the available, 90% of CPU time and swap space, and this ‘caused the machine to hang and become unavailable. This resulted in the communication failure between the agents and host and caused ICS tasks to hang or fail. Also, real time messages were not processed during this disruption.
The initial symptom was observed Tuesday night (June 20th) 22:35 PDT and task and process failures were detected for a few Orgs. Remediation was made to address the problem initially, however the problem reappeared Wednesday morning (June 21st) at approximately 07:30 PDT, and caused the channel server nodes to go down causing the ICS, ICRT and related services to become unavailable for all customers on APP2 on Wednesday morning (June 21st).
Note: A Disaster Recovery (DR) process was started the morning of June 21st, but was later stopped once we addressed the issue on the APP2 host.
Once the problem was addressed on the channel server nodes and the nodes were restarted, they operated normally. ICS tasks were manually restarted, or automatically executed, based on schedules.
Applicable to ICRT Customer: Post an agent restart ICRT processes operated normally.
Applicable to ICRT outbound messaging customers: The ICRT Salesforce outbound listener was resumed to process the incoming messages from Salesforce.
RISK-REDUCTION REMEDIATION ACTIONS TAKEN
Actions that have already been taken to reduce the risk of a future occurrence.
ACTION TAKEN and Planned for this incident
1. Focus on Audit system parameters monitoring and alerts. Swap space monitoring for this channel server was missing. We now added Swap space monitoring to the monitoring/alerting system. Tested and working fine.
2. Introducing an additional channel server to distribute the load. Adding this additional server will not impact customers.
3. Re-evaluate the emergency notification and escalation process, including more frequent simulations.
4. RCA with Vendor in-progress to determine the backup process impact on the system, which triggered a spike in swap space utilization. The backup service is an optional service and has been turned OFF on the channel server nodes.
Long Term Remediation Plan
Roadmap to a quicker recovery to reduce the downtime duration
Review the current DR process and adjust criteria for when to start the DR process.
Work with hosting provider and gain full visibility of all processes running on the nodes.
Add additional monitoring for relevant system parameters and define thresholds on which we can get alerted.
Additional emergency communication & notification plans (other than trust site updates).
An enhancement to the Trust Site to subscribe for notifications.