Regular-expression constructs

The regular expressions used in searches and segmentation rules are those supported by Java. If you need more specific information, please consult http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html.

You can find simple tutorials on the web (http://www.regular-expressions.info/quickstart.html, for example.)


The construct...

...matches the following:


Flags

(?i)

Enables case-insensitive matching (by default, the pattern is case-sensitive).


Characters

x

The character x, except the following...

\uhhhh

The character with hexadecimal value 0xhhhh

\t

The tab character ('\u0009')

\n

The newline (line feed) character ('\u000A')

\r

The carriage-return character ('\u000D')

\f

The form-feed character ('\u000C')

\a The alert (bell) character ('\u0007')

\e The escape character ('\u001B')

\cx

The control character corresponding to x

\0n

The character with octal value 0n (0 <= n <= 7)

\0nn

The character with octal value 0nn (0 <= n <= 7)

\0mnn

The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7)

\xhh

The character with hexadecimal value 0xhh


Quotation

\

Nothing, but quotes the following character. This is required if you would like to enter of the meta characters !$()*+.<>?[\]^{|} to match as themselves.

\\

For example, this is the backslash character

\Q

Nothing, but quotes all characters until \E

\E

Nothing, but ends quoting started by \Q


Classes for Unicode blocks and categories

\p{InGreek}

A character in the Greek block (simple block)

\p{Lu}

An uppercase letter (simple category)

\p{Sc}

A currency symbol

\P{InGreek}

Any character except one in the Greek block (negation)

[\p{L}&&[^\p{Lu}]]

Any letter except an uppercase letter (subtraction)


Character classes

[abc]

a, b, or c (simple class)

[^abc]

Any character except a, b, or c (negation)

[a-zA-Z]

a through z or A through Z, inclusive (range)


Predefined character classes

.

Any character (except for line terminators)

\d

A digit: [0-9]

\D

A non-digit: [^0-9]

\s

A whitespace character: [ \t\n\x0B\f\r]

\S

A non-whitespace character: [^\s]

\w

A word character: [a-zA-Z_0-9]

\W

A non-word character: [^\w]


Boundary matchers

^

The beginning of a line

$

The end of a line

\b

A word boundary

\B

A non-word boundary


Greedy quantifiers

These will match as much as they can. For example, a+ will match aaa in aaabbb

X?

X, once or not at all

X*

X, zero or more times

X+

X, one or more times


Reluctant (non-greedy) quantifiers

These will match as little as they can. For example, a+? will match the first a in aaabbb

X??

X, once or not at all

X*?

X, zero or more times

X+?

X, one or more times


Logical operators

XY

X followed by Y

X|Y

Either X or Y

(XY)

XY as a single group



Legal notices Home Index of contents