String Pattern Matching

Applies to: ✅ Data Cloud SQL ✅ Tableau Hyper API

There are two approaches to pattern matching.

  • SQL LIKE operator
  • SIMILAR TO operator

The LIKE expression returns true if the <string> matches the supplied <pattern>. (As expected, the NOT LIKE expression returns false if LIKE returns true, and vice versa. An equivalent expression is NOT (string LIKE pattern).)

If <pattern> doesn’t contain percent signs or underscores, then the pattern only represents the string itself. In that case, LIKE acts like the equals operator. An underscore (_) in <pattern> matches any single character. A percent sign (%) matches any sequence of zero or more characters.

Some examples:

LIKE pattern matching always covers the entire string. To match a sequence anywhere within a string, start and end the pattern with a percent sign.

To match a literal underscore or percent sign without matching other characters, precede respective character in <pattern> by the escape character. The default escape character is the backslash, but a different one can be selected by using the ESCAPE clause. To match the escape character itself, write two escape characters.

It’s also possible to use no escape character by writing ESCAPE '', which disables the ability to escape characters.

The keyword ILIKE can be used instead of LIKE to make the match case-insensitive according to the active locale.

The operator ~~ is equivalent to LIKE, and ~~* corresponds to ILIKE. There are also !~~ and !~~* operators that represent NOT LIKE and NOT ILIKE, respectively.

Regular expressions are a more powerful language for pattern matching than that supported by the LIKE operator. They allow for describing complex patterns, such as repeated substrings, the absence of patterns, classes of characters, and more.

The available operators for pattern matching with regular expressions are:

OperatorDescriptionExample
~Matches regular expression, case sensitive'thomas' ~ '.*thomas.*'
~*Matches regular expression, case insensitive'thomas' ~* '.*Thomas.*'
!~Doesn’t match regular expression, case sensitive'thomas' !~ '.*Thomas.*'
!~*Does not match regular expression, case insensitive'thomas' !~* '.*vadim.*'

Some examples:

For information about the syntax Hyper supports, see Regular Expression Syntax.

The regexp_replace function provides substitution of new text for substrings that match POSIX regular expression patterns. It has the syntax regexp_replace(source, pattern, replacement[, flags ]). The source string is returned unchanged if there isn’t a match to the pattern. If there is a match, the source string is returned with the replacement string substituted for the matching substring.

The replacement string can contain \N, where N is 1 through 9, to show where the pattern is inserted into the source substring. Write \\ to put a literal backslash in the replacement text.

pattern is searched from the beginning of the string. By default, only the first match of the pattern is replaced. If the g flag is given, then all matches at or after the start position are replaced. The i flag enables case-insensitive matching. Flags can be combined in a single string.

Some examples:

Most characters (letters a-z, A-Z and digits 0-9) have no special meaning and match themselves. However, certain metacharacters (. \ + * ? ^ $ | ( ) [ ] { }) have special semantics:

  • A dot (.) matches any single character (except newline characters).
  • A pipe (|) matches both expressions to either side of it.

The rest of this sections describes the remaining metacharacters.

To match metacharacters literally, escape them using a backslash \. For example, \+ matches a literal plus sign, and \\ matches a literal backslash.

PatternMatches
hello"hello"
c.t"cat", "cot", "c@t"
dog|cat"dog", "cat"
\."."

Quantifiers (*, +, ?, { }) specify how many times the preceding (sub-)pattern must occur to match. By default, quantifiers are greedy, meaning they try to match as much text as possible.

QuantifierDescriptionExampleMatches
*Zero or moreab*cac, abc, abbc, abbbc, ...
+One or moreab+cabc, abbc, abbbc, ...
?Zero or oneab?cac, abc
{n}Exactly nab{2}cabbc
{n,}At least nab{2,}cabbc, abbbc
{n,m}At least n but not more than mab{1,2}cabc, abbc

The minimum and maximum number of repetitions for the counting forms ({n}, {n,}, {n,m}) is 1000.

The ? metacharacter turns a quantifier into its non-greedy ("lazy" or "reluctant") variant. Lazy quantifiers try to match as few characters as possible. For example, in the string <b>bold</b>, the (greedy) pattern <.*> matches the entire string, whereas the (lazy) pattern <.*?> only matches <b>.

QuantifierDescription
*?Zero or more, prefer fewer
+?One or more, prefer one
??Zero or one, prefer zero
{n}?Exactly n (lazy and greedy variants are identical)
{n,}?At least n, prefer fewer
{n,m}?At least n but not more than m, prefer fewer

Groups ( (...) ) combine characters and metacharacters into a single unit. You can use them to apply quantifiers to sub-patterns.

PatternMatches
(abc)abc
(a|b)cac, bc
a(b*)cac, abc, abbc, abbbc, ...

Additionally, you can use groups to modify the behavior of certain patterns via the (?flags:pattern) syntax. The following flags are supported:

FlagDescription
iCharacter matches are case-insensitive
m^ and $ match the start and end of a line, respectively (see Anchors)
sDot (.) matches newlines
UQuantifiers match lazily by default

For example, (?i:abc) sets the case-insensitive flag, so it matches abc, ABC, aBc, and more.

Character classes are a shorthand for common group patterns.

The most general form of a character class is [ ], matching any character contained in the square brackets. For example, you can abbreviate the pattern (a|e|i|o|u) to [aeiou]. Contiguous ranges of characters can be shortened by using a hyphen: [a-e] is equivalent to [abcde].

The range notation for character classes ([a-e]) expands to the sequence of characters defined by their Unicode code points, disregarding locale-specific collations.

You can negate character classes by placing a caret (^) after the opening bracket. For example, [^aeiou] matches any character that is not a vowel.

Most metacharacters have no special meaning inside a character class (e.g., [.] matches a literal dot). The main exceptions are \\, ^ (when at the start for negation), - (for ranges), and ]. To include a literal ] in a character class, put it first (e.g., []abc]) or escape it [ab\\]c]. To include a literal -, put it first or last, or escape it.

Some examples:

PatternEquivalent to
[aeiou](a|e|i|o|u)
[a-zA-Z](a|b\ ... | z | A | B | ... | Z)
[^f-zF-Z]N/A

Some common character classes are predefined and can be referred to by name and/or a shorthand. Note that the backslash-escaped shorthands don't use the enclosing [] brackets.

PatternDescription
[[:alpha:]]alphabetic ([A-Za-z])
[[:upper:]]upper case ([A-Z])
[[:lower:]]lower case ([a-z])
[[:digit:]], \ddigits ([0-9])
[[:xdigit:]]hexadecimal digits ([0-9A-Fa-f])
\Dnon-digits ([^0-9])
[[:alnum:]]alphanumeric ([0-9A-Za-z])
[[:word:]], \wword characters ([0-9A-Za-z_])
\Wnon-word characters ([^0-9A-Za-z_])
[[:space:]], \swhitespace ([\t\n\f\r ])
\Snon-whitespace ([^ \t\n\f\r])
[[:blank:]]blankspace ([\t ])
[[:punct:]]punctuation ([!-/:-@[\-\`{-~])

Anchors match a position, rather than a character.

AnchorDescription
^Start of string (or line, if m flag is set)
\AStart of string
$End of string (or line, if m flag is set)
\zEnd of string
\bASCII word boundary (\w on one side; \W, \A, or \z on the other)
\BNot an ASCII word boundary

These patterns can be used to anchor a match to a specific position in a string. For example, you can ensure that a given string fully matches a pattern by surrounding it with ^ and $ anchors.