Borza Industries: RegexDroid help

Regular expressions

In computing, regular expressions, also referred to as regex or regexp, provide a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.

The following examples illustrate a few specifications that could be expressed in a regular expression:

The sequence of characters "car" appearing consecutively in any context, such as in "car", "cartoon", or "bicarbonate"
The sequence of characters "car" occurring in that order with other characters between them, such as in "Icelander" or "chandler"
The word "car" when it appears as an isolated word
The word "car" when preceded by the word "blue" or "red"
The word "car" when not preceded by the word "motor"
A dollar sign immediately followed by one or more digits, and then optionally a period and exactly two more digits (for example, "$10" or "$245.99").

Regular expressions can be much more complex than these examples.

A regular expression consists of literal text, meta characters, character
sets, and operators. The latter three have a special meaning when
encountered during the processing of a pattern.

Meta characters are a special means to describe
single characters in the input text. A common example for a meta
character is the dot '.', which, when used in a regular expression,
matches any character.
Character sets are a convenient means to
describe different characters that match a single character in the
input. Character sets are enclosed in angular brackets '[' and ']'
and use the dash '-' for forming ranges. A typical example is
"[0-9a-fA-F]", which describes the set of all hexadecimal digits.
Operators modify or combine whole regular
expressions, with the result being a regular expression again. An
example for an operator is the asterisk '*', which, together with the
regular expression preceding it, matches zero or more repetitions of
that regular expression. The plus sign '+' is similar, but requires at
least one occurrence.

Meta characters, the '[' and ']' that form a character set, and operators
normally lose their special meaning when preceded by a backslash '\'. To get
a backslash by itself, use a double backslash. Note that when using regular
expressions in Java source code, some care has to be taken to get the
backslashes right (due to yet another level of escaping being necessary for
Java).

The following table gives some basic examples of regular expressions and
input strings that match them:

Regular expression	Matched string(s)
"Hello, World!"	"Hello, World!"
"Hello, World."	"Hello, World!", "Hello, World?"
"Hello, .*d!"	"Hello, World!", "Hello, Android!", "Hello, Dad!"
"[0-9]+ green bottles"	"0 green bottles", "25 green bottles", "1234 green bottles"

The following section describe the various features in detail. The are also
some implementation notes at the end.

Meta characters

The following two tables lists the meta characters understood in regular
expressions.

Meta character	Description
\a	Match a BELL, \u0007.
\A	Match at the beginning of the input. Differs from ^ in that \A will not match after a new line within the input.
\b, outside of a character set	Match if the current position is a word boundary. Boundaries occur at the transitions between word (\w) and non-word (\W) characters, with combining marks ignored.
\b, within a character set	Match a BACKSPACE, \u0008.
\B	Match if the current position is not a word boundary.
\cX	Match a control-X character (replace X with actual character).
\e	Match an ESCAPE, \u001B.
\E	Ends quoting started by \Q. Meta characters, character classes, and operators become active again.
\f	Match a FORM FEED, \u000C.
\G	Match if the current position is at the end of the previous match.
\n	Match a LINE FEED, \u000A.
\N{UNICODE CHARACTER NAME}	Match the named Unicode character.
\Q	Quotes all following characters until \E. The following text is treated as literal.
\r	Match a CARRIAGE RETURN, \u000D.
\t	Match a HORIZONTAL TABULATION, \u0009.
\uhhhh	Match the character with the hex value hhhh.
\Uhhhhhhhh	Match the character with the hex value hhhhhhhh. Exactly eight hex digits must be provided, even though the largest Unicode code point is \U0010ffff.
\x{hhhh}	Match the character with the hex value hhhh. From one to six hex digits may be supplied.
\xhh	Match the character with the hex value hh.
\Z	Match if the current position is at the end of input, but before the final line terminator, if one exists.
\z	Match if the current position is at the end of input.
\0n, \0nn, \0nnn	Match the character with the octal value n, nn, or nnn. Maximum value is 0377.
\n	Back Reference. Match whatever the nth capturing group matched. n must be a number > 1 and < total number of capture groups in the pattern. Note: Octal escapes, such as \012, are not supported in ICU regular expressions
[character set]	Match any one character from the character set. See character sets for a full description of what may appear between the angular brackets.
.	Match any character.
^	Match at the beginning of a line.
$	Match at the end of a line.
\	Quotes the following character, so that is loses any special meaning it might have.

Character sets

The following table lists the syntax elements allowed inside a character
set:

Element	Description
[a]	The character set consisting of the letter 'a' only.
[xyz]	The character set consisting of the letters 'x', 'y', and 'z', described by explicit enumeration.
[x-z]	The character set consisting of the letters 'x', 'y', and 'z', described by means of a range.
[^xyz]	The character set consisting of everything but the letters 'x', 'y', and 'z'.
[[a-f][0-9]]	The character set formed by building the union of the two character sets [a-f] and [0-9].
[[a-z]&&[jkl]]	The character set formed by building the intersection of the two character sets [a-z] and [jkl]. You can also use a single '&', but this regular expression might not be portable.
[[a-z]--[jkl]]	The character set formed by building the difference of the two character sets [a-z] and [jkl]. You can also use a single '-'. This operator is generally not portable.

A couple of frequently used character sets are predefined and named.
These can be referenced by their name, but behave otherwise similar to
explicit character sets. The following table lists them:

Character set	Description
\d, \D	The set consisting of all digit characters (\d) or the opposite of it (\D).
\s, \S	The set consisting of all space characters (\s) or the opposite of it (\S).
\w, \W	The set consisting of all word characters (\w) or the opposite of it (\W).
\X	The set of all grapheme clusters.
\p{NAME}, \P{NAME}	The Posix set with the specified NAME (\p{}) or the opposite of it (\P{}) - Legal values for NAME are 'Alnum', 'Alpha', 'ASCII', 'Blank', 'Cntrl', 'Digit', 'Graph', 'Lower', 'Print', 'Punct', 'Upper', 'XDigit' .
\p{inBLOCK}, \P{inBLOCK}	The character set equivalent to the given Unicode BLOCK (\p{}) or the opposite of it (\P{}). An example for a legal BLOCK name is 'Hebrew', meaning, unsurprisingly, all Hebrew characters.
\p{CATEGORY}, \P{CATEGORY}	The character set equivalent to the Unicode CATEGORY (\p{}) or the opposite of it (\P{}). An example for a legal CATEGORY name is 'Lu', meaning all uppercase letters.
\p{javaMETHOD}, \P{javaMETHOD}	The character set equivalent to the isMETHOD() operation of the Java Character class (\p{}) or the opposite of it (\P{}).

Operators

The following table lists the operators understood inside regular
expressions:

Operator	Description
\|	Alternation. A\|B matches either A or B.
*	Match 0 or more times. Match as many times as possible.
+	Match 1 or more times. Match as many times as possible.
?	Match zero or one times. Prefer one.
{n}	Match exactly n times
{n,}	Match at least n times. Match as many times as possible.
{n,m}	Match between n and m times. Match as many times as possible, but not more than m.
*?	Match 0 or more times. Match as few times as possible.
+?	Match 1 or more times. Match as few times as possible.
??	Match zero or one times. Prefer zero.
{n}?	Match exactly n times.
{n,}?	Match at least n times, but no more than required for an overall pattern match
{n,m}?	Match between n and m times. Match as few times as possible, but not less than n.
*+	Match 0 or more times. Match as many times as possible when first encountered, do not retry with fewer even if overall match fails (Possessive Match)
++	Match 1 or more times. Possessive match.
?+	Match zero or one times. Possessive match.
{n}+	Match exactly n times.
{n,}+	Match at least n times. Possessive Match.
{n,m}+	Match between n and m times. Possessive Match.
( ... )	Capturing parentheses. Range of input that matched the parenthesized subexpression is available after the match.
(?: ... )	Non-capturing parentheses. Groups the included pattern, but does not provide capturing of matching text. Somewhat more efficient than capturing parentheses.
(?> ... )	Atomic-match parentheses. First match of the parenthesized subexpression is the only one tried; if it does not lead to an overall pattern match, back up the search for a match to a position before the "(?>"
(?# ... )	Free-format comment (?# comment ).
(?= ... )	Look-ahead assertion. True if the parenthesized pattern matches at the current input position, but does not advance the input position.
(?! ... )	Negative look-ahead assertion. True if the parenthesized pattern does not match at the current input position. Does not advance the input position.
(?<= ... )	Look-behind assertion. True if the parenthesized pattern matches text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.)
(?<! ... )	Negative Look-behind assertion. True if the parenthesized pattern does not match text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.)
(?ismwx-ismwx: ... )	Flag settings. Evaluate the parenthesized expression with the specified flags enabled or -disabled.
(?ismwx-ismwx)	Flag settings. Change the flag settings. Changes apply to the portion of the pattern following the setting. For example, (?i) changes to a case insensitive match.

Implementation notes

The regular expression implementation used in Android is provided by
ICU. The notation for the regular
expressions is mostly a superset of those used in other Java language
implementations. This means that existing applications will normally work as
expected, but in rare cases some regular expression content that is meant to
be literal might be interpreted with a special meaning. The most notable
examples are the single '&', which can also be used as the intersection
operator for character sets, and the intersection
operators '-' and '--'. Also, some of the flags are handled in a
slightly different way:

Borza Industries

28 novembra, 2010

RegexDroid help

Regular expressions

Meta characters

Character sets

Operators

Implementation notes

Žiadne komentáre:

Zverejnenie komentára