String Pattern Matching
Applies to: ✅ Data 360 SQL ✅ Tableau Hyper API
There are two approaches to pattern matching.
- SQL
LIKEoperator - Regular expressions
The LIKE expression returns true if the <string> matches the supplied <pattern>. (As expected, the NOT LIKE expression returns false if LIKE returns true, and vice versa. An equivalent expression is NOT (string LIKE pattern).)
If <pattern> doesn’t contain percent signs or underscores, then the pattern only represents the string itself. In that case, LIKE acts like the equals operator. An underscore (_) in <pattern> matches any single character. A percent sign (%) matches any sequence of zero or more characters.
Some examples:
LIKE pattern matching always covers the entire string. To match a sequence anywhere within a string, start and end the pattern with a percent sign.
To match a literal underscore or percent sign without matching other characters, precede respective character in <pattern> by the escape character. The default escape character is the backslash, but a different one can be selected by using the ESCAPE clause. To match the escape character itself, write two escape characters.
It’s also possible to use no escape character by writing ESCAPE '', which disables the ability to escape characters.
The keyword ILIKE can be used instead of LIKE to make the match case-insensitive according to the active locale.
The operator ~~ is equivalent to LIKE, and ~~* corresponds to ILIKE. There are also !~~ and !~~* operators that represent NOT LIKE and NOT ILIKE, respectively.
Regular expressions are a more powerful language for pattern matching than that supported by the LIKE operator.
They allow for describing complex patterns, such as repeated substrings, the absence of patterns, classes of characters, and more.
The available operators for pattern matching with regular expressions are:
| Operator | Description | Example |
|---|---|---|
~ | Matches regular expression, case sensitive | 'thomas' ~ '.*thomas.*' |
~* | Matches regular expression, case insensitive | 'thomas' ~* '.*Thomas.*' |
!~ | Doesn’t match regular expression, case sensitive | 'thomas' !~ '.*Thomas.*' |
!~* | Does not match regular expression, case insensitive | 'thomas' !~* '.*vadim.*' |
Some examples:
For information about the syntax Hyper supports, see Regular Expression Syntax.
With regular expression functions you can search for, extract, replace, and count substrings in text using regular expression patterns. This table summarizes the available functions, their purposes, and example usage.
| Function | Return Type | Description | Example |
|---|---|---|---|
regexp_like | boolean | Returns true if the string matches the regular expression pattern. | regexp_like('Data Cloud', 'Cloud') → true |
regexp_replace | text | Replaces substrings matching the pattern with a replacement string. The replacement string can contain \N, where N is 1 through 9, to show where the pattern is inserted into the source substring. | regexp_replace('foo123bar', '\d+', 'X') → 'fooXbar' |
regexp_instr | int | Returns the position of the substring matching the regular expression. Returns 0 if there’s no substring match. | regexp_instr('abc123def', '\d+') → 4 |
regexp_substr | text | Returns the substring matching the regular expression pattern. Returns NULL if there is no match. | regexp_substr('abc123def', '\d+') → '123' |
regexp_count | int | Returns the number of times the pattern occurs in the string. | regexp_count('a1b2c3', '\d') → 3 |
Most characters (letters a-z, A-Z and digits 0-9) have no special meaning and match themselves. However, certain metacharacters (. \ + * ? ^ $ | ( ) [ ] { }) have special semantics:
- A dot (
.) matches any single character (except newline characters). - A pipe (
|) matches both expressions to either side of it.
The rest of this sections describes the remaining metacharacters.
To match metacharacters literally, escape them using a backslash \.
For example, \+ matches a literal plus sign, and \\ matches a literal backslash.
| Pattern | Matches |
|---|---|
hello | "hello" |
c.t | "cat", "cot", "c@t" |
dog|cat | "dog", "cat" |
\. | "." |
Quantifiers (*, +, ?, { }) specify how many times the preceding (sub-)pattern must occur to match. By default, quantifiers are greedy, meaning they try to match as much text as possible.
| Quantifier | Description | Example | Matches |
|---|---|---|---|
* | Zero or more | ab*c | ac, abc, abbc, abbbc, ... |
+ | One or more | ab+c | abc, abbc, abbbc, ... |
? | Zero or one | ab?c | ac, abc |
{n} | Exactly n | ab{2}c | abbc |
{n,} | At least n | ab{2,}c | abbc, abbbc |
{n,m} | At least n but not more than m | ab{1,2}c | abc, abbc |
The minimum and maximum number of repetitions for the counting forms ({n}, {n,}, {n,m}) is 1000.
The ? metacharacter turns a quantifier into its non-greedy ("lazy" or "reluctant") variant.
Lazy quantifiers try to match as few characters as possible.
For example, in the string <b>bold</b>, the (greedy) pattern <.*> matches the entire string, whereas the (lazy) pattern <.*?> only matches <b>.
| Quantifier | Description |
|---|---|
*? | Zero or more, prefer fewer |
+? | One or more, prefer one |
?? | Zero or one, prefer zero |
{n}? | Exactly n (lazy and greedy variants are identical) |
{n,}? | At least n, prefer fewer |
{n,m}? | At least n but not more than m, prefer fewer |
Groups ( (...) ) combine characters and metacharacters into a single unit. You can use them to apply quantifiers to sub-patterns.
| Pattern | Matches |
|---|---|
(abc) | abc |
(a|b)c | ac, bc |
a(b*)c | ac, abc, abbc, abbbc, ... |
Additionally, you can use flags to modify the behavior of certain pattern groups via the (?flags:pattern) syntax.
The following flags are supported:
| Flag | Description |
|---|---|
i | Character matches are case-insensitive |
c | Character matches are case-sensitive |
q | Treats the characters as string literal characters |
g | All matches at or after the start position are replaced |
For example, (?i:abc) sets the case-insensitive flag, so it matches abc, ABC, aBc, and more.
Character classes are a shorthand for common group patterns.
The most general form of a character class is [ ], matching any character contained in the square brackets.
For example, you can abbreviate the pattern (a|e|i|o|u) to [aeiou].
Contiguous ranges of characters can be shortened by using a hyphen: [a-e] is equivalent to [abcde].
The range notation for character classes ([a-e]) expands to the sequence of characters defined by their Unicode code points, disregarding locale-specific collations.
You can negate character classes by placing a caret (^) after the opening bracket.
For example, [^aeiou] matches any character that is not a vowel.
Most metacharacters have no special meaning inside a character class (e.g., [.] matches a literal dot). The main exceptions are \\, ^ (when at the start for negation), - (for ranges), and ]. To include a literal ] in a character class, put it first (e.g., []abc]) or escape it [ab\\]c]. To include a literal -, put it first or last, or escape it.
Some examples:
| Pattern | Equivalent to |
|---|---|
[aeiou] | (a|e|i|o|u) |
[a-zA-Z] | (a|b\ ... | z | A | B | ... | Z) |
[^f-zF-Z] | N/A |
Some common character classes are predefined and can be referred to by name and/or a shorthand.
Note that the backslash-escaped shorthands don't use the enclosing [] brackets.
| Pattern | Description |
|---|---|
[[:alpha:]] | alphabetic ([A-Za-z]) |
[[:upper:]] | upper case ([A-Z]) |
[[:lower:]] | lower case ([a-z]) |
[[:digit:]], \d | digits ([0-9]) |
[[:xdigit:]] | hexadecimal digits ([0-9A-Fa-f]) |
\D | non-digits ([^0-9]) |
[[:alnum:]] | alphanumeric ([0-9A-Za-z]) |
[[:word:]], \w | word characters ([0-9A-Za-z_]) |
\W | non-word characters ([^0-9A-Za-z_]) |
[[:space:]], \s | whitespace ([\t\n\f\r ]) |
\S | non-whitespace ([^ \t\n\f\r]) |
[[:blank:]] | blankspace ([\t ]) |
[[:punct:]] | punctuation ([!-/:-@[\-\`{-~]) |
Anchors match a position, rather than a character.
| Anchor | Description |
|---|---|
^ | Start of string (or line, if m flag is set) |
\A | Start of string |
$ | End of string (or line, if m flag is set) |
\z | End of string |
\b | ASCII word boundary (\w on one side; \W, \A, or \z on the other) |
\B | Not an ASCII word boundary |
These patterns can be used to anchor a match to a specific position in a string. For example, you can ensure that a given string fully matches a pattern by surrounding it with ^ and $ anchors.