Q1, what happens when a workflow is running while a node is failing:
If the Integration Service is set to run on a particular node, then the Integration Service will try to restart on any backup node (if any are defined for that Integration Service) and continue to work. If no backup node is defined, the workflow will fail and the Integration Service will remove it from any regular schedule.
If the Integration Service is set to run on a grid, then any process running on a failing node will - as far as technically feasible - be "moved" to another node in the same grid; as far as technically possible, sessions will continue to run on that other node, the workflow will continue on the other node.
For Q2 the answers are basically the same: if you have a backup node defined for an Integration Service runnong on a node, then the workflow will be moved to the backup node(s); if no backup node is defined, the workflow will not execute.
If the Integration Service is defined to run on a node, it doesn't matter which nodes are available; as long as at least one node is available, the workflow will execute on one available node.
May I ask why you don't want to share infrastructure? High Availability is highly dependent on shared infrastructure for many processes to run smoothly (or to continue as smoothly as possible in case of node failures).
Thanks for your answer.
A few months ago the platform was a little unstable having fails 2 - 3 times per week. Because of this, one of the two teams wants to have separate infraestructure to avoid failures provoqued by procceses or any reason related to the other team.
But, in case of failure, they are willing to collaborate with the available infraestructure assigned to each team.
So, we are considering this architecture:
We have to be sure, just in case one of nodes fails the other one will take control of all workflows (running, suspended or waiting for run). Our principal concern is about workflows that are waiting to run and are assigned to a node that is failed.
Not a good approach in my opinion. Here's how I see the situation.
First regarding the repository service(s):
You cannot have a PowerCenter repository service run in HA in the same way as an Integration Service. You can define a primary node for the repository service and additional backup nodes; as soon as the primary node goes down, the second node will take over the repository service in order to guarantee as smooth a processing as possible. So just set up a backup node for the existing repository service, that's the best you can do.
Next regarding the integration service(s):
It makes - as of my experience - most sense to define grid-wide integration services only. This way it simply doesn't matter which nodes are available, the domain will cater for restarts of the integration service(s) on its own completely transparently to you.
Now comes a very big BUT:
As soon as the current gateway node fails and another gateway node takes over, all client applications need to be reconfigured. Why? Because for each domain you define the IP host name and address in the local domains.infa file resp. in invocations of pmcmd and infacmd.sh / infacmd.bat . There's no way to circumvent this necessity.
The only thing you can do (and I cannot tell, to be honest, whether this caters for all possible troubles) is to define a virtual IP host name and address and assign this virtual IP in the DNS server to the current gateway node.
Moreover, whenever the current gateway node fails to work, this virtual IP address must be switched to the next master gateway node (in your case, the other node).
One more thing to keep in mind: you cannot guarantee that any kind of recovery (be it workflow recovery, session recovery, or whatever) will always be capable of keeping up with a node failure. There are circumstances where even the pretty stable HA fail-over of the Informatica platform and PowerCenter cannot save all records which were supposed to be processed: for example, if the caching hard disk fails (e.g. due to a controller break-down), then you're in trouble. It's as simple as that.
And don't ever think that won't happen to you due to the SAN setup: in 2006 I have witnessed a case where one of the hard disk controllers of a central SAP storage device broke down in the very same moment when the local hard disk controller of the AIX machine broke down. Meaning we did lose some data. Fortunately the PowerCenter infrastructure was set up so niftily that one thing only had to be done to remedy the situation: 13 SAP R/3 IDocs had to be re-sent, that was all. Everything else started right away when the AIX machine and the SAN device were online again, but again, this was a very intelligent setup regarding PowerCenter processes, not thanks to technical measures.
In short: you should never rely on technical measures to solve stability problems. You have to cater for them from a process-oriented point of view.