4 Replies Latest reply on Jul 19, 2021 8:19 AM by Nico Heinze

    Reg_Extract Confusion

    Jacob Sørensen New Member



      I have been playing around with the reg_extract expression in my mapping. The expression is confusing beyond comprehension. My task is to grap a sequence of numbers from a string.


      Example string: AABB1234568AABC*AB


      Normally I would just use something like \d+ to grap all numbers - however when I use such simple regular expressions in powercenter reg_extract it just returns a null value.


      The more I have been trying to solve this issue the more I am beginning to be convinced that reg_extract wants your regular expression to capture ALL of the text in the string - and you can divide this string into as many subPatterns as you want - just as long as you use ALL of the string.


      So, in order to capture the example string above i would need to use (.+)(\d{7})(.+).



      However the strings that will run through my workflow will be a complete random mess. So I will not be able to capture all of the string for each row that passes through. I could understand you could use reg_extract to get first name and last name of a field if this field had the same string format for all rows (e.g. "First-name Last-name").


      Is there some way to handle strings which changes content/format from row to row? (e.g. some will have numbers, some will not, some will have numbers in the end and visa versa).


      Thanks for your time

        • 1. Re: Reg_Extract Confusion
          JanLeendert Wijkhuijs Guru



          Given your last sentence I think you read the manual carefully.
          Basically you have a set of numbers (\d+)\w+ followed by alphanumerical characters.

          Is there always one set of numerical characters in your string or can there be multiple sets?

          If there is only one set of numerical characters then you can go for a completely different approach.


          • 2. Re: Reg_Extract Confusion
            Jacob Sørensen New Member

            Hi JanLeendert,


            Thank you for your reply. It is several billions of rows with personal written text from different people. In other words, the string/text could be anything.


            I have solved the issue. I came up with the idea of doing it backwards. Somehow informatica has decided that reg_replace can use regular expressions without needing to use all of the string (in constrast to reg_extract). So I just removed all other characters which is not a number. Finally i could use reg_extract to fetch the portion of the number i needed.


            It is beyond my comprehension why informatica has decded that reg_extract must use all of the string (with subpatterns). It simply makes no sense and frankly almost defeats the purpose of using regular expression.

            • 3. Re: Reg_Extract Confusion
              Nico Heinze Guru

              Assuming there is always exactly one group of digits in your input string, the following REG_EXTRACT() should work fine:


              REG_EXTRACT( inputstring, '\(.*\)\(\d+\)\(.*\)', 2, 0)


              This will return the second sub-pattern from a match.


              Now what does this regex mean?

              It consists of three parts:

              - '\(.*\)' = any characters at the beginning of the string.

              - '\(\d+\)' = any sequence of at least one digit 0-9.

              - '\(.*\)' = another sequence of arbitrary characters.


              Granted, in this particular example the \( and \) around the first and third pattern are superfluous, but it may help in future if you need to extract data from this starting or ending part of text.


              But, as JanLeendert explained, this is the solution for one particular case. If your use case is a little more complex, then this approach probably will not be sufficient for you. We need more details from you.




              • 4. Re: Reg_Extract Confusion
                Nico Heinze Guru

                Sorry to be pernickety again, but I have the impression some explanation is useful here.


                There is a very good reason why the complete pattern must match the whole input string and not only a part of it (as, for example, on www.regex101.com which is very useful but..). The reason is actually pretty simple:


                When you take a look at the "original" regular expressions under Unix (Basic Regular Expressions, BREs, and Extended Regular Expressions, EREs), you will find that they have been set up to match a complete input string. Not only any subpattern of it.

                And to be honest, for me this behaviour makes sense. When, for example, I want to check whether a file name fits some pattern (e.g. '.*_20210719.*.txt'), then I want a file london_20210719.txt to match this pattern, but a file london_20210719.txt.bkp simply does not fit the pattern, so it should not match.

                If, however, you expect the pattern to be compared against any part of the string, then london_20210719.txt.bkp would match. And frankly spoken this would lead to behaviour that an old-fashioned developer like me simply would not expect and which would cause me bad headaches and sleepless nights until I found the reason.


                Of course, one can argue that always having to set a .* in front of and after my actual search pattern is boring.

                On the other hand this behaviour works as expected in Unix/Linux commands (Perl and www.regex101.com are the most notable exceptions I know), and in my opinion it's not bad to prepend and append the ,* to the search pattern if I really want to have all matching substrings found.

                On the other hand I have to explicitly indicate ^ and $ for the begin and end of a search string if I know that the search pattern may match anywhere in the string. And this does not necessarily make things clearer for me.


                One way or the other, you will always find people who prefer one way to the other. So a software company simply has to make a decision which way to go. And, as mentioned above, personally I find the way how Informatica implemented it clearer than to always keep in my mind, "prepend ^ and append $ if I want to match the whole string".


                I hope by now you can comprehend why Informatica has made this decision. Whether you like it or not, that's a completely different thing I don't intend to discuss here. We all have to live with the implementations, but in this case I do understand why Informatica has decided to go the implemented route. There are many cases where this is not the case.


                One last question from my side: why do you consider this behaviour to almost defeat the purpose of regular expressions? That's something which I cannot comprehend without your explanation, please.