28 novembra, 2010

RegexDroid help

Regular expressions

In computing, regular expressions, also referred to as regex or regexp, provide a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.


The following examples illustrate a few specifications that could be expressed in a regular expression:
  • The sequence of characters "car" appearing consecutively in any context, such as in "car", "cartoon", or "bicarbonate"
  • The sequence of characters "car" occurring in that order with other characters between them, such as in "Icelander" or "chandler"
  • The word "car" when it appears as an isolated word
  • The word "car" when preceded by the word "blue" or "red"
  • The word "car" when not preceded by the word "motor"
  • A dollar sign immediately followed by one or more digits, and then optionally a period and exactly two more digits (for example, "$10" or "$245.99").
Regular expressions can be much more complex than these examples.


A regular expression consists of literal text, meta characters, character
sets, and operators. The latter three have a special meaning when
encountered during the processing of a pattern.


  • Meta characters are a special means to describe
    single characters in the input text. A common example for a meta
    character is the dot '.', which, when used in a regular expression,
    matches any character.

  • Character sets are a convenient means to
    describe different characters that match a single character in the
    input. Character sets are enclosed in angular brackets '[' and ']'
    and use the dash '-' for forming ranges. A typical example is
    "[0-9a-fA-F]", which describes the set of all hexadecimal digits.

  • Operators modify or combine whole regular
    expressions, with the result being a regular expression again. An
    example for an operator is the asterisk '*', which, together with the
    regular expression preceding it, matches zero or more repetitions of
    that regular expression. The plus sign '+' is similar, but requires at
    least one occurrence.

Meta characters, the '[' and ']' that form a character set, and operators
normally lose their special meaning when preceded by a backslash '\'. To get
a backslash by itself, use a double backslash. Note that when using regular
expressions in Java source code, some care has to be taken to get the
backslashes right (due to yet another level of escaping being necessary for
Java).



The following table gives some basic examples of regular expressions and
input strings that match them:





Regular expression

Matched string(s)

"Hello, World!"

"Hello, World!"

"Hello, World."

"Hello, World!", "Hello, World?"

"Hello, .*d!"

"Hello, World!", "Hello, Android!", "Hello, Dad!"

"[0-9]+ green bottles"

"0 green bottles", "25 green bottles", "1234 green bottles"


The following section describe the various features in detail. The are also
some implementation notes at the end.





Meta characters



The following two tables lists the meta characters understood in regular
expressions.



Meta character

Description

\a

Match a BELL, \u0007.

\A

Match at the beginning of the input. Differs from ^ in that
\A will not match after a new line within the input.

\b, outside of a character set


Match if the current position is a word boundary. Boundaries
occur at the transitions between word (\w) and non-word (\W)
characters, with combining marks ignored.

\b, within a character set


Match a BACKSPACE, \u0008.

\B


Match if the current position is not a word boundary.

\cX

Match a control-X character (replace X with actual character).

\e

Match an ESCAPE, \u001B.

\E

Ends quoting started by \Q. Meta characters, character classes, and
operators become active again.

\f

Match a FORM FEED, \u000C.

\G

Match if the current position is at the end of the previous
match.

\n

Match a LINE FEED, \u000A.

\N{UNICODE CHARACTER NAME}

Match the named Unicode character.

\Q

Quotes all following characters until \E. The following text is
treated as literal.

\r

Match a CARRIAGE RETURN, \u000D.

\t

Match a HORIZONTAL TABULATION, \u0009.

\uhhhh

Match the character with the hex value hhhh.

\Uhhhhhhhh

Match the character with the hex value hhhhhhhh. Exactly
eight hex digits must be provided, even though the largest Unicode
code point is \U0010ffff.

\x{hhhh}

Match the character with the hex value hhhh. From one to six hex
digits may be supplied.

\xhh

Match the character with the hex value hh.

\Z

Match if the current position is at the end of input, but
before the final line terminator, if one exists.

\z

Match if the current position is at the end of input.

\0n, \0nn, \0nnn

Match the character with the octal value n, nn, or nnn. Maximum
value is 0377.

\n

Back Reference. Match whatever the nth capturing group
matched. n must be a number > 1 and < total number of capture
groups in the pattern. Note: Octal escapes, such as \012, are not
supported in ICU regular expressions

[character set]

Match any one character from the character set. See
character sets for a full description of what
may appear between the angular brackets.

.

Match any character.

^

Match at the beginning of a line.

$

Match at the end of a line.

\

Quotes the following character, so that is loses any special
meaning it might have.






Character sets


The following table lists the syntax elements allowed inside a character
set:




Element

Description

[a]

The character set consisting of the letter 'a' only.


[xyz]

The character set consisting of the letters 'x', 'y', and 'z',
described by explicit enumeration.

[x-z]

The character set consisting of the letters 'x', 'y', and 'z',
described by means of a range.

[^xyz]

The character set consisting of everything but the letters 'x', 'y',
and 'z'.

[[a-f][0-9]]

The character set formed by building the union of the two character
sets [a-f] and [0-9].

[[a-z]&&[jkl]]

The character set formed by building the intersection of the two
character sets [a-z] and [jkl]. You can also use a single '&', but
this regular expression might not be portable.

[[a-z]--[jkl]]

The character set formed by building the difference of the two
character sets [a-z] and [jkl]. You can also use a single '-'. This
operator is generally not portable.


A couple of frequently used character sets are predefined and named.
These can be referenced by their name, but behave otherwise similar to
explicit character sets. The following table lists them:





Character set

Description

\d, \D

The set consisting of all digit characters (\d) or the opposite of
it (\D).

\s, \S

The set consisting of all space characters (\s) or the opposite of
it (\S).

\w, \W

The set consisting of all word characters (\w) or the opposite
of it (\W).

\X

The set of all grapheme clusters.

\p{NAME}, \P{NAME}

The Posix set with the specified NAME (\p{}) or the opposite
of it (\P{}) - Legal values for NAME are 'Alnum', 'Alpha', 'ASCII',
'Blank', 'Cntrl', 'Digit', 'Graph', 'Lower', 'Print', 'Punct',
'Upper', 'XDigit' .

\p{inBLOCK}, \P{inBLOCK}

The character set equivalent to the given Unicode BLOCK (\p{}) or
the opposite of it (\P{}). An example for a legal BLOCK name is
'Hebrew', meaning, unsurprisingly, all Hebrew characters.

\p{CATEGORY}, \P{CATEGORY}

The character set equivalent to the Unicode CATEGORY (\p{}) or the
opposite of it (\P{}). An example for a legal CATEGORY name is 'Lu',
meaning all uppercase letters.

\p{javaMETHOD}, \P{javaMETHOD}

The character set equivalent to the isMETHOD() operation of the Java Character class (\p{}) or the opposite of it (\P{}).




Operators


The following table lists the operators understood inside regular
expressions:







Operator

Description

|

Alternation. A|B matches either A or B.

*

Match 0 or more times. Match as many times as possible.

+

Match 1 or more times. Match as many times as possible.

?

Match zero or one times. Prefer one.

{n}

Match exactly n times

{n,}

Match at least n times. Match as many times as possible.

{n,m}

Match between n and m times. Match as many times as possible,
but not more than m.

*?

Match 0 or more times. Match as few times as possible.

+?

Match 1 or more times. Match as few times as possible.

??

Match zero or one times. Prefer zero.

{n}?

Match exactly n times.

{n,}?

Match at least n times, but no more than required for an
overall pattern match

{n,m}?

Match between n and m times. Match as few times as possible,
but not less than n.

*+

Match 0 or more times. Match as many times as possible when
first encountered, do not retry with fewer even if overall match
fails (Possessive Match)

++

Match 1 or more times. Possessive match.

?+

Match zero or one times. Possessive match.

{n}+

Match exactly n times.

{n,}+

Match at least n times. Possessive Match.

{n,m}+

Match between n and m times. Possessive Match.

( ... )

Capturing parentheses. Range of input that matched the
parenthesized subexpression is available after the match.

(?: ... )

Non-capturing parentheses. Groups the included pattern, but
does not provide capturing of matching text. Somewhat more efficient
than capturing parentheses.

(?> ... )

Atomic-match parentheses. First match of the parenthesized
subexpression is the only one tried; if it does not lead to an
overall pattern match, back up the search for a match to a position
before the "(?>"

(?# ... )


Free-format comment (?# comment ).

(?= ... )

Look-ahead assertion. True if the parenthesized pattern
matches at the current input position, but does not advance the
input position.

(?! ... )

Negative look-ahead assertion. True if the parenthesized
pattern does not match at the current input position. Does not
advance the input position.

(?<= ... )

Look-behind assertion. True if the parenthesized pattern
matches text preceding the current input position, with the last
character of the match being the input character just before the
current position. Does not alter the input position. The length of
possible strings matched by the look-behind pattern must not be
unbounded (no * or + operators.)


(?<! ... )

Negative Look-behind assertion. True if the parenthesized
pattern does not match text preceding the current input position,
with the last character of the match being the input character just
before the current position. Does not alter the input position. The
length of possible strings matched by the look-behind pattern must
not be unbounded (no * or + operators.)

(?ismwx-ismwx: ... )

Flag settings. Evaluate the parenthesized expression with the
specified flags enabled or -disabled.

(?ismwx-ismwx)

Flag settings. Change the flag settings. Changes apply to the
portion of the pattern following the setting. For example, (?i)
changes to a case insensitive match.






Implementation notes


The regular expression implementation used in Android is provided by
ICU. The notation for the regular
expressions is mostly a superset of those used in other Java language
implementations. This means that existing applications will normally work as
expected, but in rare cases some regular expression content that is meant to
be literal might be interpreted with a special meaning. The most notable
examples are the single '&', which can also be used as the intersection
operator for character sets, and the intersection
operators '-' and '--'. Also, some of the flags are handled in a
slightly different way:

Žiadne komentáre:

Zverejnenie komentára