11 Replies Latest reply on Jun 20, 2016 4:42 AM by user160783
      • 2. Re: How to implement SCD2 using Informatica and hive(Hadoop). Please share detail level information.
        New Member

        Hello Tushar,

         

        Tried to create a Step by Step guide for the same. Please go through the same and create a POC and let me know if it suits your needs. I have not tested this approach for SCD - Type - 2 but have designed and Tested the similar approach for SCD - Type - 1 and it was working. Also check the youtube link below from Informatica CDC -

         

        Informatica Big Data Edition: Change Data Capture - YouTube

         

        Prerequisites

        Master / Hive / Target / Base table should have the following audit columns –

        • Start Time - Will be utilized to mention the Start Time of the record
        • End Time - Will be utilized to mention the End Time of the record
        • Key MD5 [Optional] - This is optional. If you want to bring the reusability to your code, you can compute this. This key can be later utilized for joining the records.
        • Data MD5 - Advise you to create this column rather than checking the data for change detection attribute by attribute.
        • Latest Record Marker / Current Flag - As history is maintained in the same table it's better to keep the attribute which will let us know what was the last record in the mapping which went through the update -
        1. e.g. Consider the data in your table as

         

        Emp_SK

        Emp_ID

        First Name

        Last Name

        Start Time

        End Time

        Last Record Marker

        10001

        1001

        J

        Snith

        2006-10-07

        2010-10-07

         

        10002

        1001

        John

        Snith

        2010-10-07

        2012-10-07

         

        10003

        1001

        John

        Smith

        2012-10-07

         

        Y

         

        • Optionally you can have Created Date / Last Modified Time, Created By, Last Modified By

         

        Also the approach discussed here is for the load type of full refresh i.e. the incoming data from the file / detail is full refresh.

         

        Step - 1 Import the Source File(Detail) and Base / Target / Hive Table (Master) in your mapping

        In this step we are referring the Imported File as Source / Detail and the Target as Hive Table in the mapping. Please make sure you don't need to perform any dedupe operation. If required on the file, please do the needful.

         

        Step - 2 Compute the Key_MD5 and Data_MD5

        Compute the MD5 values for your keys and data from the data coming from the file. Utilize the expression transformation for the same.

         

        Step - 3 Join Master and Detail based on the Primary Keys

        If you have computed key md5 in the prior step, then use the key_md5 to join with the data coming from your master / hive table. If your join is based on the primary key or composite key in the table, then please use the same to join the data in the joiner transformation. Please perform the full outer join here. In joiner transformation connect the attributes coming from file to the detail ports and the attributes flowing from hive table to master ports. In joiner transformation, rename the ports by appending MSTR to the master(Target Hive Table) ports and DTL to the ports from Details(File)

         

        Step - 4 Compute the CRUD ( Create / Read / Update / Delete) and other ports relevant to SCD - Type - 2

        Pull in all the ports from the master and detail to Create a variable port and create the expression for the same using the code logic below -

        Expression  for the same should be –

         

        CRUD Expression

         

        IIF(

        (ISNULL(mstr_key_md5)),

        'CREATE',

        IIF(

        (NOT ISNULL(mstr_key_md5)) AND (NOT ISNULL(DTL_key_MD5)) AND (MSTR_data_md5 <> DTL_data_md5 AND mstr_latest_data_flg = 'Y')),

        'UPDATE',

        IIF (

        (NOT ISNULL(mstr_KEY_MD5)) AND ISNULL(DTL_KEY_MD5) AND ISNULL(mstr_end_time) AND mstr_latest_data_flg = 'Y',

        'DELETE',

        'READ'

        )

        )

        )

         

        CRUD Description

        • If there are no records from Hive (Master) that means only the file(Detail) have provided you the data so all the records should be marked as CREATE.
        • If the records are received from both Hive(Master) and file(Detail) and also the Data MD5 is not matching for the latest record then the record should be marked for update.
        • If the Master Key is Not Null but the Detail Key is Null and the Master Record was not end Timed before based on the latest record data flag then the record should be marked for Deletion.
        • Else All the other records should be marked for Read Only.

        Generate one more variable with the expression SYSTIMESTAMP() and name the variable as V_TIME.

         

        Additionally generate the following output ports -

         

        O_CREATE_TIME => IIF(V_CRUD = 'CREATE',V_TIME,mstr_create_time)

        O_CREATED_BY => IIF(V_CRUD = 'CREATE',V_CREATED_BY,mstr_created_by)

        O_LAST_MOD_TIME => IIF(V_CRUD = 'UPDATE' OR V_CRUD = 'DELETE',V_TIME,IIF(ISNULL(mstr_last_mod_time),V_TIME,mstr_last_mod_time))

        O_LAST_MOD_BY => IIF(V_CRUD = 'UPDATE' OR V_CRUD = 'DELETE',V_MODIFIED_BY,IIF(ISNULL(mstr_last_mod_by),V_CREATED_BY,mstr_last_mod_by))

        O_START_TIME =>

        IIF(

        V_CRUD = 'CREATE' OR (V_CRUD = 'UPDATE' AND mstr_latest_data_flg = 'Y'),

        V_TIME,

        mstr_start_time

        )

         

        O_END_TIME =>

        IIF(

        V_CRUD = 'DELETE' OR (V_CRUD = 'UPDATE' AND mstr_latest_data_flg = 'Y'),

        V_TIME,

        mstr_end_time

        )

         

        WE have two more variables created here as V_CREATED_BY and V_LAST_MOD_BY. You may default the variable values to the execution user. Also make sure to create the Output Port for the CRUD port you derived.

         

        Step - 5 Router and Expression Combo to Segregate the ports

        Connect all the Master and Detail Ports and Create the following groups in the router transformation -

         

        DEFAULT_DETAIL_CREATE => O_CRUD = 'CREATE'

        DEFAULT_MASTER_DETAIL_UPDATE => O_CRUD = 'UPDATE'

        DEFAULT_MASTER_READ_DELETE => O_CRUD = 'DELETE' OR O_CRUD = 'READ'

         

        Expression for DEFAULT_DETAIL_CREATE

        Create the expression transformation to the output of the DEFAULT_DETAIL_CREATE ports from the Router. Connect all the ports coming from Detail / File only. Additionally connect the ports for CREATE_TIME, CREATED_BY, LAST_MOD_TIME, LAST_MOD_BY, START_TIME, END_TIME, Data MD5, Key MD5, Default the Latest Record Marker to 'Y'.

         

        Expression DEFAULT_MASTER_DETAIL_UPDATE

        Create two expression transformations to the output of the DEFAULT_MASTER_DETAIL_UPDATE Ports from the Router.

         

        • To the first expression transformation connect all the ports from the detail. Additionally connect the ports for CREATE_TIME, CREATED_BY, LAST_MOD_TIME, LAST_MOD_BY, START_TIME, END_TIME, Data MD5, Key MD5, Default the Latest Record Marker to 'Y'. This is the same thing we did in the prior step.
        • To the first expression transformation connect all the ports from the master. Additionally connect the ports for CREATE_TIME, CREATED_BY, LAST_MOD_TIME, LAST_MOD_BY, START_TIME, Default END_TIME to SYSTIMESTAMP(), Data MD5, Key MD5, Default the Latest Record Marker to Null.

        Expression DEFAULT_MASTER_READ_DELETE

        Create the expression transformation to the output of the DEFAULT_MASTER_READ_DELETE Ports from the Router. Connect all the ports coming from Master / Hive only. Additionally connect the ports for CREATE_TIME, CREATED_BY, LAST_MOD_TIME, LAST_MOD_BY, START_TIME, END_TIME, Data MD5, Key MD5, Latest Record Marker.

         

        • In the expression transformation if the CRUD port value is 'DELETE' then replace the LAST_MOD_TIME value to SYSTIMESTAMP() else retain the value for LAST_MOD_TIME.
        • In the expression transformation if the CRUD port value is 'DELETE' then replace the LAST_MOD_BY value to user name executing the mapping  else retain the value for LAST_MOD_TIME.
        • In the expression transformation if the CRUD port value is 'DELETE' then replace the END_TIME value to user name executing the mapping  else retain the value for END_TIME.

         

        At the end of the Step - 5 you have altogether 4 expression transformations.

         

        Step - 6 Perform the Union

        Perform the Union of the data from all the expression transformations and connect the Target Hive Table. Make sure you select the load type for the Target as Truncate and Load and if you have the table partitioned then select to truncate the partition to. Please also make sure that the Validation Environment you select should be Hive and specify the connection details for the hive environment.

         

        Your mapping should look like the one below -

         

        Upsert_SCD_Type_2_001.png

         

        S = Source / File / Detail (Step - 1)

        T = Target / Base / Master / Hive Table (Step - 1)

        E1 = Expression Transformation (Step - 2)

        J = Joiner (Step - 3)

        E2 = Expression to derive CRUD (Step - 4)

        R = Router (Step - 5)

        E3...E6 = Expression Transformation (Step - 5)

        U = Union Transformation (Step - 6)

        T = Target Hive Table (Step - 6)

        • 3. Re: How to implement SCD2 using Informatica and hive(Hadoop). Please share detail level information.
          New Member

          Hi Shaounak,

           

          I need more clarifications on above logic, so could you please share your number or whats app me on my below number.

           

          Tushar +91-9860337350

           

          Thanks,

          Tushar

          • 5. Re: How to implement SCD2 using Informatica and hive(Hadoop). Please share detail level information.
            New Member

            Thanks Shounak for the reply.

             

            I have loaded the data as mentioned by you in the staging table and then created two mappings I.e for insert and update in informatica developer tool. Also partitioned the target table by flag or latest record marker.

             

            Now I am facing one more challenge as below.

            I  have implemented  the logic for the same as well only surrogate key generation is pending.

             

            I request you please let me know how to generate surrogate key in hive using informatica developer in hive mode.

             

            As my source is flat file and target is hive database, please let me know if you know any workaround for sequence generation.

             

            Thanks in advance.


            Regards,

            Tushar

            • 6. Re: How to implement SCD2 using Informatica and hive(Hadoop). Please share detail level information.
              Guru

              Hi Tushar,

               

              Please refer to the below link which explains an workaround for sequence generator.

               

              http://www.folkstalk.com/2012/05/avoiding-sequence-generator.html

               

              Hope this helps.

               

              Thanks and Regards,

              Sindhu

              • 7. Re: How to implement SCD2 using Informatica and hive(Hadoop). Please share detail level information.
                New Member

                Hi Sindhu,

                 

                We are not able generate the sequence using above method. Please see below error for reference.

                 

                "[Hive Validation] The transformation [Expression] contains stateful variable ports. Stateful variable ports are not valid in the Hive environment. See the following port details: [The port [v_seq] has an expression [IIF(v_seq>0,v_seq+1,1)] that refers to the value of port [v_seq] from the previous row. The stateful variable port [v_seq] is not valid in the Hive environment.]."

                 

                Thanks,

                Tushar

                • 8. Re: How to implement SCD2 using Informatica and hive(Hadoop). Please share detail level information.
                  Brian Shepherd Seasoned Veteran

                  Tushar/Sindhu,

                   

                  Using a stateful variable ("v_seq") is not supported in Hadoop pushdown mode, at least with MapReduce or Tez. It should work if you are using the Blaze execution engine. The reason is due to how these execution engines work. Since the data being processed can be split into multiple mappers, the value of v_seq would need to be tracked across those multiple mappers and that just isn't possible as they are designed to run independent of each other.

                  • 9. Re: How to implement SCD2 using Informatica and hive(Hadoop). Please share detail level information.
                    Active Member

                    I dont think Blaze has variable option. I see below in 10.0 release guide which has Blaze engine

                     

                    A transformation that contains a stateful variable port is not valid in a Hadoop environment.

                     

                    Tushar,

                    we can use Java transformation for a random number, it wont be in sequence but will be Unique

                    • 10. Re: How to implement SCD2 using Informatica and hive(Hadoop). Please share detail level information.
                      Seasoned Veteran

                      Does your key have to be a number? If not, you can use the UUID and UUParse functions to generate unique keys.

                      • 11. Re: How to implement SCD2 using Informatica and hive(Hadoop). Please share detail level information.
                        New Member

                        You use use Row Number function to generate Sequence Generator. This is what I am doing currently for my project.

                        We created a temporary table which will contain mapping name, target name and max sequence id.

                        Now we are generating Row number and get the max sequence id in Source Query. In an expression, we are adding max sequence id and row number to generate the sequence in continuation.

                        Temporary table will need an update after the full load completes.