1 Reply Latest reply on Oct 12, 2021 6:23 AM by Nico Heinze

    Website Validation using REG_MATCH

    Sanchit Sharma New Member

      Hi Folks,

      I have query to filter out Websites based on a few rules.

      Any combination of numbers, letters and hyphens are acceptable.

      --Any letters of the alphabet

      --Any numbers 0 to 9

      --Hyphens are acceptable; Domain names cannot begin or end with a hyphen

      --Other forms of punctuation, symbols or accent characters cannot be used

      --Does not have to begin with www, http, https

      --It's not stated but periods (.) are also acceptable


      Few Example  :
      http://www.abc.com  --  accepted
      http://abc.com   -- accepted

      https://www.abc.com  --  accepted

      https://abc.com   -- accepted

      www.abc.com --  accepted

      not accepted  -- (é), grave (è), circumflex (â, î or ô), tilde (ñ),
      htt3ps://abc.com  -- not accepted


      I am using Expression transformation, with this Expression :

      IIF(NOT(ISNULL(Website)),

      IIF(REG_MATCH(Website, '[a-zA-Z0-9]*'),

      IIF(INSTR(Website,'http')=1,'',IIF(INSTR(Website,'www')=1,'','Website is invalid.'))

      ,'')

      ,'')


      Can you please enhance this rule to meet my requirement.

      Regards,
      Sanchit

        • 1. Re: Website Validation using REG_MATCH
          Nico Heinze Guru

          I would choose a different approach instead of trying to get it all in one single regular expression (not only for performance reasons but also for the sake of maintainability and readability).

           

          First I would check whether the string begins with "www." or "http://" or "https://"; in all three cases I would cut off this initial part and continue with the remainder for the following steps.

          If the string starts with neither of these parts, take the whole input string as the "remainder" for the following steps.

           

          Second web site validation is a little more complex than in the rules stated above.

          For example, dots (=periods) are acceptable but, as far as I know, neither at the beginning nor at the end of a web site string.

          On the other hand each web site name must end in some domain suffix (such as .gov, .com, .de, or whatever), and this suffix is always introduced by a dot.

           

          So my next step would be to check that the web site string ends with a dot "." followed by any combination of at least one letter a-z, digit 0-9, or underscore (though I haven't seen any internet domain holding an underscore, but I wouldn't count on it simply because I don't know better). So the dot "." must be followed by something which looks like the pattern "[a-z0-9_]+".

          If the web site string ends with a dot followed by this pattern, fine, take everything before the dot and continue with the next step.

          If not, then the string is not acceptable anyway.

           

          Now the remainder (befpre the dot followed by the internet domain) must look roughly like this:

          - at least one letter or digit or underscore,

          - any number of "groups" consisting of the following items:

              - a dot,

              - at least one letter or digit or underscore.

           

          Using the function SUBSTR(), INSTR(), and REG_MATCH() you should be able to implement all these parts.

           

          Does that help? If not, please ask again.

           

          Regards,

          Nico