13 Replies Latest reply on Apr 1, 2020 6:03 AM by Nico Heinze

    Auditing Duplicate filenames using Unix commands and workflow variables

    Tj G Seasoned Veteran

      I am planning to implement a Audit log table like below.

       

      I have assigned the  Workflow parameters files for workflow name, start time, end time, but i need to somehow assign a shell script or unix command to a variable parameter called 'duplicate file name' in order to generate all the duplicate files in the target able.

       

       

       

      Below is my mapping structure

       

       

       

      Below is my workflow implementation

       

       

      i want to implement the assignment operator with a single unix comand, something like this

       

       

       

       

      Please guide me, the correct solution

       

      @Nico Heinze - i would really appreciate your inputs.

        • 1. Re: Auditing Duplicate filenames using Unix commands and workflow variables
          user126898 Guru

          Are you trying to assign the duplicate file name to the name of the shell script/cmd? or the value that it returns?

           

          1) if the name of the file you can just use a parameter file and inherit the name from in there.

           

          2) if you want to assign from the output, you could insert an CMD task in the workflow to execute the script or cmd and have said script/cmd output its return into the parameter file which then the downstream sessions inherit and use the duplicate file name.

           

          thanks,

          Scott

          • 2. Re: Auditing Duplicate filenames using Unix commands and workflow variables
            Tj G Seasoned Veteran

            Hey Scott,

             

            Right now, I am trying to assign the value of an output(duplicate filenames), with the 'ls' command

             

            ls /path/*.xlsx and assign the output to a variable name.

             

            So, I am following your 2nd solution, Please guide me through if I am following it correctly.

             

            I have a assigned a variable name to a command output.

             

             

             

             

            But I am stuck at the assignment operator. Could you help me with that.

             

            • 3. Re: Auditing Duplicate filenames using Unix commands and workflow variables
              Nico Heinze Guru

              Hi Tj,

               

              Honestly I don't really understand what you're trying to achieve, so I'll repeat with my own words what I understood so far:

               

              First you want to audit workflow executions.

              Second you want to audit workflow executions with a file name you have already processed any time in the past.

               

              Is that correct?

              If not, could you please explain some sample executions of this "framework"? In plain text.

               

              Thanks,

              Nico

              • 4. Re: Auditing Duplicate filenames using Unix commands and workflow variables
                Tj G Seasoned Veteran

                Good Morning Nico,

                 

                1) In this scenario, I am trying to audit only the duplicate filenames(like wf start time, wf end time, duplicate file name) like below screenshot not the entire workflow executions (like,

                SRC_Success_Rows,SRC_Failed_Rows,TGT_Success_Rows,TGT_Failed_Rows).

                 

                 

                2) yes, I have processed the filenames in the past in such a way that whatever the duplicate filenames found shall exist in the 'ls /path/*.xlsx'  . Hence I am trying to assigning a unix command to a variable parameter.

                • 5. Re: Auditing Duplicate filenames using Unix commands and workflow variables
                  Nico Heinze Guru

                  I don't know of any modern file system which allows two files to have exactly the same name. So how do you recognise - from just looking at the directory - that a file name already has been processed in the past? That can't be achieved solely with OS methods, you will need some additional logic.

                  For example, maybe you have some archive directory where all processed files are moved to after processing. In this case you can "source" the list of files already processed (flat file source, Input Type = Command, command = "ls -1 /path/*.xlsx"), sort it, and then join with the newly arrived file names (extracted in the same way but in a different path).

                  If you don't have such an archive subdirectory, then you will need to source the existing file names (as described above) and the file names from the audit table, sort both streams (for better join performance), and join them using a Full Outer Join. If this Joiner indicates that a new file has the same name as a file already processed, then... well, you have to decide whether you want to process this file again or simply remove it or whatever.

                   

                  And that's the big question: finding out that a file name has already appeared in the past is easy. But what then? What shall happen to this file? Shall the "worker workflow" be invoked with this file? Shall the file be skipped? Deleted?

                   

                  I hope you get what I mean: technical details for one or the other point are one thing. The overall process and process design is a completely different story, that's what I'm trying to point out. As long as I have the impression that the overall process might not be clearly defined, I'll keep on asking questions. You know me.

                   

                  Regards,

                  Nico

                  • 6. Re: Auditing Duplicate filenames using Unix commands and workflow variables
                    Tj G Seasoned Veteran

                    Hello Nico,

                     

                    I took care of how to find the duplicate filenames and what to do with the duplicate filenames(need to wait for further approval from business either to delete or overwrite), until then the only task now is to send the details to Audit table like below.

                     

                    This task is to only Audit the files in the path(ls /path/*.xlsx). because the processed files are already moved to another path and ETL taken care of.

                     

                     

                    • 7. Re: Auditing Duplicate filenames using Unix commands and workflow variables
                      Nico Heinze Guru

                      Sorry, but if the processed files are already moved somewhere else, how do you want to identify duplicate files? From the audit table alone?

                       

                      Just want to make sure we're talking about the same things.

                       

                      If yes, then you simply have to check each and every file name in the source path whether it already exists in the audit table; if not, write its name to a file list (naming the files to be processed). otherwise skip this file name (or save it for some other later use, such as deleting the file).

                      Then the actual work session has to source the file list listing the files which have not been delivered twice until now.

                       

                      Regards,

                      Nico

                      • 8. Re: Auditing Duplicate filenames using Unix commands and workflow variables
                        Tj G Seasoned Veteran

                        Hello Nico,

                         

                        I am not identifying through the audit table. There is a pre-session script running to find that out.

                         

                        ok, let me explain the whole thing I am trying to execute here. There are files inside the Archive folder already. So, Now I am trying to run and ETL , which downloads new files from the FTP site and the downloaded filenames shall be compared to the existing filenames inside the archive folder and then which ever filenames are new and unique , shall be proceeded with the ETL(using a pre-session script to download the files and identity the duplicate files names among the new files and the files inside the archive folder) and appended to the target table and processed filenames shall be moved inside the archive folder.

                         

                        While The duplicate filename like this Airtimeacts_2020-02008.xlsx or some .csv file ( which are left behind with no ETL processing, awaiting for business approval to overwrite or delete it) . But meanwhile we need to audit all the duplicate files found(in this case Airtimeacts_2020-02008.xlsx) in a audit table specifically created for duplicate files only.

                         

                         

                        This is the task I am trying to achieve Nico. Hope I am clear.

                        • 9. Re: Auditing Duplicate filenames using Unix commands and workflow variables
                          Nico Heinze Guru

                          OK, now I (think I) understand. Thanks for the explanation.

                           

                          My suggestion is to completely separate the two distinct part processes, namely the identification of duplicate files and preparing the processing of the "clean" file names from the actual processing of those "clean" files.

                          The first part (identifying duplicate files) is already implemented, if I understood you correctly. You should simply insert some process (e.g. a small mapping) which somehow gets informed of all the duplicate file names (which could be delivered, for example, as a list file) and writes these file names to the audit table.

                          For all other files you can either prepare a file list so that the actual "worker" workflow can iterate through this file list, or you can prepare a shell script as a target file in the first part process: this shell script invokes the "worker" workflow once for each input file which shall be processed.

                           

                          Does that make sense to you?

                           

                          Regards,

                          Nico

                          • 10. Re: Auditing Duplicate filenames using Unix commands and workflow variables
                            Tj G Seasoned Veteran

                            Yes, you understood me correctly.

                             

                            So, you suggest me to use a small mapping to some how gets informed of all the duplicate filenames. Yes, that's the challenge here Nico. Is there a possibility of implementing unix command(ls /path*.xlsx && ls /path*.csv) in mapping, below.

                             

                             

                            command as ls /path/*.xlsx && ls /path/*.csv directed to 'Duplicate_file_column'

                             

                            Hope you understand what I am trying to do here.

                            • 11. Re: Auditing Duplicate filenames using Unix commands and workflow variables
                              Nico Heinze Guru

                              Use a flat file source definition with one string field only; set the delimiter character to \037 (yes, enter these four characters: a backslash followed by the digits 0, 3, and 7). This will set the delimiter character to the ASCII control character ASCII 31 = Unit Separator. I haven't seen this character in use at least since the early 1980s, so the danger of encountering it in real text files is - to say at least - extremely low.

                              Now in the session set the Input Type to Command instead of File, then press the Tab key. Now the following fields change, and one of the new entry fields is named Command. Here you can enter something like this:

                                  ls -1 /path/*.xlsx

                              This will make sure that the file names (and only the file names) are retrieved by the "ls" command and are available in the mapping as the one field from the Source Qualifier.

                               

                              Regards,

                              Nico

                              1 of 1 people found this helpful
                              • 12. Re: Auditing Duplicate filenames using Unix commands and workflow variables
                                Tj G Seasoned Veteran

                                That worked like a charm., As usual you are a savior!

                                • 13. Re: Auditing Duplicate filenames using Unix commands and workflow variables
                                  Nico Heinze Guru

                                  It's a pleasure for me to help when I can. Thanks a million for the kudos.

                                   

                                  Cheers,

                                  Nico