5 Replies Latest reply on Jul 21, 2021 2:51 AM by Nico Heinze

    Joiner Transformation Guidelines

    Nicholas Yuricic New Member

      Hi Team,

       

      I'm putting together a series of logical data objects that will require mulitple joins. Each object will require anywhere between 15 to 30 joins per object. I'm relatively new to IDQ and recall from my training that it is not advisable to use too many joins in a single object. I'm having trouble remembering the exact guidelines on limiting the use of the Joiner transformation. Generally speaking, how many Joins in a logical data object is too many?

       

      I'm aware that Informatica's recommendation is to deal with this in the database but I currently do not have that option open to me right now. I appreciate any advice you guys can give me.

       

      Thank you,

        • 1. Re: Joiner Transformation Guidelines
          Nico Heinze Guru

          Don't use Joiner transformations in the LDO mappings.

          Drag the LDOs into your mapping, then use Joiners inside the mapping to join those sources.

          Most importantly, if at all possible, make sure that the input data are sorted by the join keys, that will help in terms of performance and of stability.

           

          Besides these general terms I can't answer your question, I'm a PowerCenter guy and no IDQ guy (and both products work pretty much differently).

           

          Regards,

          Nico

          • 2. Re: Joiner Transformation Guidelines
            Robert Whelan Guru

            Anything which applies to the Joiner in Powercenter should be applicable in IDQ. The Joiner Tx in IDQ is 'borrowed' from Powercenter.

            Differing UI, but I believe the underlying C/C++ code will be similar.

            • 3. Re: Joiner Transformation Guidelines
              user126898 Guru

              Lets look at this thru a lense.

               

              Why are you using LDO's in the first?  Are you trying to profile multiple data objects together from different data sources, are you trying to profile the same data source but multiple objects that have not yet been joined together in the lake/ODS/warehouse?  If yes then use the joiner inside the LDO mapping as the requirement dictates it.  Now can you use 30 joiners...sure if you have the memory/disk/CPU of the underlying data integration service to handle it.

               

              To Nico's point if the LDO's are being used by just a core DI and trying to load data from A to B then you should just use mappings  as LDO's are not needed.

               

              Thanks,

              Scott

              • 4. Re: Joiner Transformation Guidelines
                Nicholas Yuricic New Member

                Hi Scott,

                 

                Yes, all of the tables I need to join reside in the same Oracle Database. The fields I need are all spread out on different tables of various sizes. A lot of these table have less than 10 columns and only a few thousand rows so I don't think we will have to worry about performance issues at this point.

                 

                I know generally it's better to use mappings by due to way our servers are setup I am not able to use mappings to write new objects. I am in the process of resolving that.

                 

                I am going to leave this open for another day in case someone has a differing opinion but it sounds like you've answered my question.

                • 5. Re: Joiner Transformation Guidelines
                  Nico Heinze Guru

                  Unfortunately I have to warn you (once more).

                  Due to the way the Joiner is built, even a "few thousand rows" can cause great havoc.

                  Just imagine that you have to join 4 tables of 1000 rows each.

                  The very first Joiner only has to cache 1000 * 1000 = 1 million rows. Let's suppose that out of these 1 million rows, there are 3000 matching rows.

                  The second Joiner will have to cache 1000 * 3000 rows. Let's suppose that out of these 3 million rows, there are 9000 matching rows.

                  The third Joiner will have to cache 1000 * 9000 rows.

                   

                  And now please estimate the cache file sizes for all these caches. Even if each record only needs 100 bytes to store it (and please keep in mind that String ports are - with one exception - always stored with their declared size, not their actual size), these caches will need roughly 100 MB, 300 MB, and 900 MB.

                   

                  And now continue this for six more caches.

                   

                  Meaning that the caches will easily eat up your temporary disk space.

                   

                  There's one way around this problem: if all Joiners are set to Sorted Input and you do provide records sorted by the join keys to each Joiner, then the caches will contain just as many records as needed for one single match. Meaning that probably you will need only a few KB, maybe a few MB for all those caches together, not more.

                   

                  Now the question is how to sort the data. If you have to sort by VARCHAR2 attributes and the Oracle source database sorts according to some localised sorting scheme (a so-called "lexicographic" sort), then you're in trouble; PowerCenter and the Informatica platform (i.e. IDQ and relatives) sort (in most cases) by character code, not by lexicographical sort order.

                  The difference can be explained with one small example from the German language with its "famous" diacritics Ä, Ö, Ü, and ß (like in München, the German name for the city of Munich). If you sort by the standard German sort order, the cities Münster, München, and Munster (yes, there's a Munster without the diacritic), then these cities will be sorted like this:

                  München

                  Munster

                  Münster

                   

                  If, however, you sort like PowerCenter and IDQ expect it, the city names have to be sorted like this:

                  Munster

                  München

                  Münster

                   

                  And if you provide this sort order to a Joiner set to Sorted Input, then the session will fail (because the data are not sorted as expected).

                   

                  PowerCenter offers a few different sort orders, namely Spanish (in two variants), Finnish, Swedish, and Danish. Nothing else. So you simply cannot request PowerCenter to sort data by the standard German order.

                  I dare to assume that IDQ offers a few sort orders as well, but probably (I don't know IDQ good enough) not what you need.

                   

                  So in rhe end you have to provide the data to the Joiners in "binary" sort order (meaning ordered by their Unicode code values resp. ISO 8859-1 or whatever your Integration Service runs in).

                  The safest way to achieve this is using the Sorter transformation, meaning one Sorter per relational database source. The Sorter has the nice feature of storing all data with their actual size, not their declared size. So in most cases the Sorter transformation will need way less cache memory than a Joiner would need.

                  This means you probably will want to provide Sorted Input to the Joiner transformations via one Sorter transformation per source data stream. The Sorter is fast, reliable, and pretty memory-eficient.

                   

                  Regards,

                  Nico