Given your last sentence I think you read the manual carefully.
Basically you have a set of numbers (\d+)\w+ followed by alphanumerical characters.
Is there always one set of numerical characters in your string or can there be multiple sets?
If there is only one set of numerical characters then you can go for a completely different approach.
Thank you for your reply. It is several billions of rows with personal written text from different people. In other words, the string/text could be anything.
I have solved the issue. I came up with the idea of doing it backwards. Somehow informatica has decided that reg_replace can use regular expressions without needing to use all of the string (in constrast to reg_extract). So I just removed all other characters which is not a number. Finally i could use reg_extract to fetch the portion of the number i needed.
It is beyond my comprehension why informatica has decded that reg_extract must use all of the string (with subpatterns). It simply makes no sense and frankly almost defeats the purpose of using regular expression.
Assuming there is always exactly one group of digits in your input string, the following REG_EXTRACT() should work fine:
REG_EXTRACT( inputstring, '\(.*\)\(\d+\)\(.*\)', 2, 0)
This will return the second sub-pattern from a match.
Now what does this regex mean?
It consists of three parts:
- '\(.*\)' = any characters at the beginning of the string.
- '\(\d+\)' = any sequence of at least one digit 0-9.
- '\(.*\)' = another sequence of arbitrary characters.
Granted, in this particular example the \( and \) around the first and third pattern are superfluous, but it may help in future if you need to extract data from this starting or ending part of text.
But, as JanLeendert explained, this is the solution for one particular case. If your use case is a little more complex, then this approach probably will not be sufficient for you. We need more details from you.
Sorry to be pernickety again, but I have the impression some explanation is useful here.
There is a very good reason why the complete pattern must match the whole input string and not only a part of it (as, for example, on www.regex101.com which is very useful but..). The reason is actually pretty simple:
When you take a look at the "original" regular expressions under Unix (Basic Regular Expressions, BREs, and Extended Regular Expressions, EREs), you will find that they have been set up to match a complete input string. Not only any subpattern of it.
And to be honest, for me this behaviour makes sense. When, for example, I want to check whether a file name fits some pattern (e.g. '.*_20210719.*.txt'), then I want a file london_20210719.txt to match this pattern, but a file london_20210719.txt.bkp simply does not fit the pattern, so it should not match.
If, however, you expect the pattern to be compared against any part of the string, then london_20210719.txt.bkp would match. And frankly spoken this would lead to behaviour that an old-fashioned developer like me simply would not expect and which would cause me bad headaches and sleepless nights until I found the reason.
Of course, one can argue that always having to set a .* in front of and after my actual search pattern is boring.
On the other hand this behaviour works as expected in Unix/Linux commands (Perl and www.regex101.com are the most notable exceptions I know), and in my opinion it's not bad to prepend and append the ,* to the search pattern if I really want to have all matching substrings found.
On the other hand I have to explicitly indicate ^ and $ for the begin and end of a search string if I know that the search pattern may match anywhere in the string. And this does not necessarily make things clearer for me.
One way or the other, you will always find people who prefer one way to the other. So a software company simply has to make a decision which way to go. And, as mentioned above, personally I find the way how Informatica implemented it clearer than to always keep in my mind, "prepend ^ and append $ if I want to match the whole string".
I hope by now you can comprehend why Informatica has made this decision. Whether you like it or not, that's a completely different thing I don't intend to discuss here. We all have to live with the implementations, but in this case I do understand why Informatica has decided to go the implemented route. There are many cases where this is not the case.
One last question from my side: why do you consider this behaviour to almost defeat the purpose of regular expressions? That's something which I cannot comprehend without your explanation, please.