I don't know enough about PWX CDC, so you better move this discussion thread to the PowerExchange forum (not PowerExchange Adapters!), there you will find the CDC experts.
Everything related to building mappings etc. can be discussed in that forum as well, no need to keep two separate copies of this thread in two forums.
Thank you very much, I have moved it to the forum you suggested.
Once-and-only-once delivery of change data to the target is the default assumption of PWX CDC.
However, once you start processing it via PowerCenter, you have to do some things to make PC adhere to that.
(Basically, PWX CDC has zero control over the data once it is pushed into the PC session memory buffers.)
Things to do at the PC design level:
Don't use any active transformations, as those can change the order of, or drop, or insert rows.
If you need child/parent integrity inside a UOW, have all those pipelines in the same mapping.
Also, set the PC custom property FullCBLOSupport, as that tells PC to preserve the order of the rows as received from PWX CDC.
At the PC session properties level:
(Set the custom property mentioned above)
Set "Source Based Commit"
Uncheck "Commit on EOF"
Set "Recovery Strategy"=="Resume from last checkpoint".
Set "Treat Source Rows As"=="Data Driven" (this tells PC to do whatever DTL__CAPXACTION says was done on the source (I=Insert, D=Delete, U=Update)
Thank you, and just a couple of caveats --
1 - For the active transformations - let's not consider that now, as if a row within a transaction gets dropped, we will have to deal with that separately, or have the entire UOW kicked out (preferably).
2 - For child/parent RI, we do not have to deal with that as no FK/PK RI is enforced at target.
3 - We are currently using Source Based Commit (have to with CDC express for Oracle, with application connections for source), we have Unchecked Commit on EOF, have Resume from Last Checkpoint as Recovery Strategy.
4 - As an 'oddity' - our source is unsupported by Informatica, so we are populating Oracle tables from that source by using INSERT only, with a 'Record Status' of I, D, or U. That 'action' inside the INSERT record dictates the Update Strategy. So the DTL_CAPXACTION is always INSERT from source, even though the destination/target action may be an update or insert. So not sure 'Treat Source Rows As' / Data Driven would apply here ??
So in short, we have no concern about the row order from source but we care greatly about how the commits to target are done - we want to commit only complete UOW's, with a rollback if any row/record within that UOW fails. When I look at application connection settings, I see that the Maximum Rows for Commit can cause commits in between UOW boundaries, so we leave that at 0 and want to use only Flush Latency and UOW count, but I just don't understand if this means commit/rollback that UOW at target.
For the sake of maintainability, may I suggest that you change your approach somewhat drastically.
It would be far safer (as of my experience) to simply stage the data from PWX CDC in some DB table.
Then have another mapping source this table and process the data according to your needs.
This enables you to use any PowerCenter mapping feature you may need, including transaction control, filtering, re-ordering of data records, and whatever.
Basically that means: separate the PWX CDC reading (which has its own necessities, as you know better than I do) "physically" from the processing of those records.
The PWX CDC part is a real-time part, while the session to read and process those data can be seen as a "normal" batch-mode session.
Having written that, I now see one problem with this approach, namely the fact that one mapping constantly writes into the staging table while the other session would need to constantly read from it (which simply doesn't work for several reasons).
It depends on your needs regarding "real-time processing" how this problem might be addressed. If you could give us a bit more information about these needs, we may be able to come up with some viable solutions. I just hope that those "needs" are not too harsh, otherwise my idea probably can't be used.
Thanks for your reply, and no problem, glad to provide any information I can, as I'm assuming I am missing a basic understanding of how Powercenter commits to target.
So we have 'staging' tables - and we have a system that feeds (transactionally) those tables. So our system might perform a transaction that updates multiple rows in multiple source (staging) tables wrapped in a transaction, which I'm expecting PWX to encapsulate in a UOW.
Now - those source tables are registered, our PWC connections are CDC application connections, we have our PWX (CDC Express for Oracle) log reader, logger, and our condense/logger files on the PWX node.
Our PWXPC and listener provides change data to our CDC sessions, where we basically just map 1 source row in 1 source table, to 1 target row in 1 target table (we reorder some columns, change some data formats, things like that), then commit to our target.
What we 'want' to happen (if it is not already happening) is for that UOW coming from source to flow through our source qualifier, source, transformation, update strategy, then get committed to target in a single UOW the same as source. We'd like to see a UOW in source of say 10 tables, 100 rows flowing through and our commit to target being that same UOW - 10 tables, 100 rows with a true commit to target - commit on success, rollback on fail.
I doubt I'm expressing in the best and complete way what our configuration is and exactly what we are wanting, but if target receives a partial transaction, even if the remainder of the transaction (UOW) comes in seconds later, a target database discrepancy. We know setting Maximum Rows to Commit to a certain value could cause a commit outside of UOW boundaries, but I'm still unclear on if setting that to 0, and say setting UOW count to 10, or 1, or 100, whatever, and using a certain Flush Latency (on our application connection settings) would force commits on (a) UOW boundaries, and (b) complete UOW's - all or none.
From what I understand about transactional transformations, it 'creates' UOW boundaries based on criteria, and not based on what the source UOW is, unless we can use some of the DTL_CAPX columns in the transactional transformation.
Thanks so much, I hope I've described a little better what my concern is and if it is being handled, or what we can change to make it be handled that way.
I suppose the key question would be: Is your target one which supports commit and rollback?
For example: RDBMS targets do. Flat Files (and similar things like HDFS) do not. Odd things like Netezza, Teradata, SybaseIQ are normally actually micro-batch loads, so they don't really support rollback.
When PowerCenter RT is in use (which is required for PWX CDC), PC sends rollbacks to the target when a session fails, and commits when it processes commits in the change stream.
Whether or not the target fully supports those is outside PC's control.
Please note that if you have multiple tables affected in the same UOW, then all those tables need pipelines in the same mapping. Commits and Rollbacks are at the session level. If you have each table in a separate mapping/session, then you're going to see different behaviour than if they are all in the same mapping/session.
Yes, if you are inserting into Oracle, then all rows are going to come through with DTL__CAPXACTION set to "I", so you can't use DataDriven.
You would need to have mapping logic to interrogate your I/U/D column, using an Update Strategy transformation to set the appropriate action.
I really don't like using UOW Count to control UOW size. I would recommend leaving it unset (0), so that the commits flow through as they were present in the source.
Again, extremely helpful, and so almost there ...
We stage (source) to Oracle, and our target is also Oracle, so no problem with support of commit/rollback.
We have (for logistical reasons which we can change) split our mappings up. So the likelihood of a transaction spanning sessions is high, so I take it if we move all our source to target back to a single mapping / single session this would correct that situation?
Our update strategy does indeed handle a 'Record Status' column we have in source as we always INSERT to source and our Record Status column indicates the true action to take on target.
So just to be clear, our commits to target will occur in complete UOW's, and will happen all/none, commit or rollback. That is Informatica logic, and we don't have to do any overly complicated mapping logic to make that happen.
And (I'm getting hammered on this) by using INSERT at source, yet using Update Strategy to look at that row to see if a record_status column indicates a delete, insert, or update, that it will not affect that row's position in the 'transaction' or UOW - by using that Update Strategy, we have not nullified the UOW processing logic of commits to target. If an INSERT into source is part of a UOW, and the commit to target is actually an UPDATE (based on update strategy) that single row will remain part of that UOW and be committed with all other rows in that same UOW.
This is extremely helpful and I appreciate your patience as I try to absorb this.
As long as all sources are in the same mapping, then the basic answer is "yes".
That said, PC does not guarantee to preserve the order of incoming rows by default.
You can tell PC to do that, by setting the custom property FullCBLOSupport at the session level.
(Full Constraint Based Load Ordering Support is what that stands for.)
Not knowing exactly what you are doing in your mapping, I would recommend that you set this.
KB that talks about FullCBLOSupport:
Note: you can set FullCBLOSupport at the Integration Service level, but I wouldn't normally recommend that.
Now I have a seemingly stupid question which turns out to be important for one potential "safety strategy".
How do you identify from your source records that one UOW is over?
But before going into this, let me explain why I consider the whole approach dangerous.
There can be cases where one record in a UOW is saved to a target table, but before the remaining records can be written (and committed) to the target table, e.g. the hard disk controller crashes; in such a case it cannot be determined safely which records are already stored in the DB and which are not.
Meaning your whole UOW logic within the CDC mapping may fail for reasons which are beyond your control.
Granted, this happens rarely. The last time I've seen this in real life was in 2007, but I know that sometimes very strange things do happen in real life; a former customer of mine had such a strange situation in 2016 where they lost parts of a transaction due to some hardware failure.
What do I want to say here?
You can have EITHER a really safe approach where transactions will never be lost (but that comes with a price), OR you can have all in one mapping with the (not very high) danger that some day something goes really wrong and you have some incomplete transactions to your target.
If you can afford to implement the "really safe" approach, then please first note the price.
The price to pay here is that you will always have a time gap between PWX CDC reading those records and those records being saved to the target tables in one transaction. This gap may - in extreme cases - span one minute.
On the other hand this approach will never lose any parts of any transactions. It may happen that - in case of e.g. such a hardware disaster - one transaction will "lie around", waiting to be processed, and you should adapt my idea such that you can "manually" correct and process those records to create a transaction in your target tables. But you can never lose any partial transactions this way.
If this sounds manageable to you, then here you are:
Separate the CDC logic and the "staging" logic into two separate mappings.
The first mapping is the PWX CDC mapping which simply takes the input records, determines which records belong to one "source transaction" (using some EXP logic and the like), and assigns a unique number per "source transaction" to all those data it gets from CDC (this "transaction number" is increased by 1 for each new "source transaction"); then these records are simply written to some staging table (set up between the PWX CDC source and your current "staging" tables), saving each record to its staging table with a flag of TC_COMMIT_TRANSACTION (this is the safest way to ensure that each record is written to its target DB).
Next you need a mapping which reads the oldest "source transaction" (or a couple of them to speed up things) from those new staging tables, checks whether each "source transaction" is complete (all other ones are either filtered out or written to some "error target" which requires immediate attention), and processes them into the target tables. As long as this mapping has exactly one target load path (please see Designer guide for more details if needed) and all targets receive their "commit flag" (TC_CONTINUE_TRANSACTION or TC_ROLLBACK_BEFORE/-AFTER or TC_COMMIT_BEFORE/-AFTER) from the same "active source", this is 100% safe to use.
(As safe as can be, taking into account that hardware failures may always occur).
Of course this second mapping often reads only part of the new "source transactions".
This means that this mapping needs to be re-run after each successful execution as soon as possible.
In order to achieve this, I would use a batch file or a shell script which invokes this second mapping and - after its successful execution - restarts immediately.
Granted, this is simplified, and I am sure there are some gotchas which I have not taken care of yet.
But I hope the general approach is clear.
And yes, of course even this approach is not 100% foolproof against hardware failures. But it's far less likely to suffer from a hardware failure than what you're currently trying to implement, at least as of my experience.