Difference between revisions of "Regular expression"
(→Tricks for restraining matches) |
(→External links) |
||
(14 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
A '''regular expression''' (abbreviated as '''regexp''', '''regex''', or '''regxp''', with plural forms '''regexps''', '''regexes''', or '''regexen''') is a string that describes or matches a set of strings, according to certain syntax rules. Regular expressions are used by many text editors and utilities to search and manipulate bodies of text based on certain patterns. Many programming languages support regular expressions for string manipulation. For example, [[Perl]] and Tcl have a powerful regular expression engine built directly into their syntax. The set of utilities (including the editor [[Sed|sed]] and the filter [[Grep (command)|grep]]) provided by [[Linux|Unix]] distributions were the first to popularize the concept of regular expressions. | A '''regular expression''' (abbreviated as '''regexp''', '''regex''', or '''regxp''', with plural forms '''regexps''', '''regexes''', or '''regexen''') is a string that describes or matches a set of strings, according to certain syntax rules. Regular expressions are used by many text editors and utilities to search and manipulate bodies of text based on certain patterns. Many programming languages support regular expressions for string manipulation. For example, [[Perl]] and Tcl have a powerful regular expression engine built directly into their syntax. The set of utilities (including the editor [[Sed|sed]] and the filter [[Grep (command)|grep]]) provided by [[Linux|Unix]] distributions were the first to popularize the concept of regular expressions. | ||
− | + | see also: [[Evolved regular expressions]] (EREs) | |
− | + | ||
− | + | ||
+ | ==Character classes (with GNU/POSIX extensions)== | ||
<pre> | <pre> | ||
− | + | [[:alnum:]] -> [A-Za-z0-9] # Alphanumeric characters | |
− | + | [[:alpha:]] -> [A-Za-z] # Alphabetic characters | |
− | + | [[:blank:]] -> [ \x09] # Space or tab characters only | |
− | + | [[:cntrl:]] -> [\x00-\x19\x7F] # Control characters | |
− | + | [[:digit:]] -> [0-9] # Numeric characters | |
− | + | [[:graph:]] -> [!-~] # Printable and visible characters | |
− | + | [[:lower:]] -> [a-z] # Lower-case alphabetic characters | |
− | + | [[:print:]] -> [ -~] # Printable (non-Control) characters | |
− | + | [[:punct:]] -> [!-/:-@[-`{-~] # Punctuation characters | |
− | + | [[:space:]] -> [ \t\v\f] # All whitespace characters | |
− | + | [[:upper:]] -> [A-Z] # Upper-case alphabetic characters | |
− | + | [[:xdigit:]] -> [0-9a-fA-F] # Hexadecimal digit characters (/[\dA-Fa-f]+/) | |
</pre> | </pre> | ||
+ | |||
+ | ==Metacharacters== | ||
+ | ===General quantifiers=== | ||
+ | * -> {0,} # match preceding item zero or more times | ||
+ | + -> {1,} # match preceding item one or more times | ||
+ | ? -> {0,1} # match preceding item zero or one time (i.e. optional) | ||
+ | |||
+ | ==Examples== | ||
=== Character literals === | === Character literals === | ||
Line 203: | Line 210: | ||
=== Modification using backreferences === | === Modification using backreferences === | ||
− | sed -r ' | + | sed -r '{{Regex/Replace|([A-Z])([0-9]{2,4}) |\2:\1 |g}}' INPUT |
INPUT : <font color="red">A37</font> B4 <font color="red">C107</font> D54112 <font color="red">E1103</font> XXX | INPUT : <font color="red">A37</font> B4 <font color="red">C107</font> D54112 <font color="red">E1103</font> XXX | ||
Line 234: | Line 241: | ||
<font color="blue">/th.*s/</font> | <font color="blue">/th.*s/</font> | ||
− | -- I want to match <font color="red" | + | -- I want to match <font color="red">the words that s</font>tart |
− | -- wi<font color="red" | + | -- wi<font color="red">th 'th' and end with 's</font>'. |
− | <font color="red" | + | <font color="red">this line matches jus</font>t right |
− | <font color="red" | + | <font color="red">this # thus # this</font>tle |
<font color="blue">/th.*?s/</font> | <font color="blue">/th.*?s/</font> | ||
− | -- I want to match <font color="red" | + | -- I want to match <font color="red">the words</font> that start |
− | -- with '<font color="red" | + | -- with '<font color="red">th' and end with 's</font>'. |
− | <font color="red" | + | <font color="red">this</font> # <font color="red">thus</font> # <font color="red">this</font>tle |
− | <font color="red" | + | <font color="red">this</font> line matches just right |
<font color="blue">/th.*?s /</font> | <font color="blue">/th.*?s /</font> | ||
Line 250: | Line 257: | ||
-- I want to match the words that start | -- I want to match the words that start | ||
-- with 'th' and end with 's'. (FINALLY!) | -- with 'th' and end with 's'. (FINALLY!) | ||
− | <font color="red" | + | <font color="red">this</font> # <font color="red">thus</font> # thistle |
− | <font color="red" | + | <font color="red">this</font> line matches just right |
=== Pattern-match modifiers === | === Pattern-match modifiers === | ||
Line 257: | Line 264: | ||
<font color="blue">/M.*[ise] /</font> | <font color="blue">/M.*[ise] /</font> | ||
− | <font color="red" | + | <font color="red">MAINE # Massachusetts </font># Colorado # |
− | mississippi # <font color="red" | + | mississippi # <font color="red">Missouri </font># Minnesota # |
<font color="blue">/M.*[ise] /i</font> | <font color="blue">/M.*[ise] /i</font> | ||
− | <font color="red" | + | <font color="red">MAINE # Massachusetts </font># Colorado # |
− | <font color="red" | + | <font color="red">mississippi # Missouri </font># Minnesota # |
<font color="blue">/M.*[ise] /gis</font> | <font color="blue">/M.*[ise] /gis</font> | ||
− | <font color="red" | + | <font color="red">MAINE # Massachusetts # Colorado # |
− | mississippi # Missouri | + | mississippi # Missouri </font># Minnesota # |
− | === Changing | + | === Changing back-reference behaviour === |
− | + | {{Regex/Replace|([A-Z])(?:-[a-z]{3}-)([0-9]*)|\1\2|g}} | |
< A-xyz-37 # B:abcd:142 # C-wxy-66 # D-qrs-93 | < A-xyz-37 # B:abcd:142 # C-wxy-66 # D-qrs-93 | ||
− | > <font color="red" | + | > <font color="red">A37</font> # B:abcd:42 # <font color="red">C66</font> # <font color="red">D93</font> |
− | === Naming | + | === Naming back-references === |
<font color="blue">import re | <font color="blue">import re | ||
Line 284: | Line 291: | ||
"\g<prefix>\g<id>", txt) </font> | "\g<prefix>\g<id>", txt) </font> | ||
− | <font color="red" | + | <font color="red">A37</font> # B:abcd:42 # <font color="red">C66</font> # <font color="red">D93</font> |
=== Lookahead assertions === | === Lookahead assertions === | ||
− | + | {{Regex/Replace|([A-Z]-)(?<nowiki>=</nowiki>[a-z]{3})([a-z0-9]* )|\2\1|g}} | |
< A-xyz37 # B-ab6142 # C-Wxy66 # D-qrs93 | < A-xyz37 # B-ab6142 # C-Wxy66 # D-qrs93 | ||
− | > <font color="red" | + | > <font color="red">xyz37A-</font> # B-ab6142 # C-Wxy66 # <font color="red">qrs93D-</font> |
− | + | {{Regex/Replace|([A-Z]-)(?![a-z]{3})([a-z0-9]* )|\2\1|g}} | |
< A-xyz37 # B-ab6142 # C-Wxy66 # D-qrs93 | < A-xyz37 # B-ab6142 # C-Wxy66 # D-qrs93 | ||
− | > A-xyz37 # <font color="red" | + | > A-xyz37 # <font color="red">ab6142B-</font> # <font color="red">Wxy66C-</font> # D-qrs93 |
=== Making regular expressions more readable === | === Making regular expressions more readable === | ||
<font color="blue"> | <font color="blue"> | ||
− | / # identify URLs within a text file | + | / </font># identify URLs within a text file<font color="blue"> |
− | [^="] # do not match URLs in IMG tags like: | + | [^="] </font># do not match URLs in IMG tags like:<font color="blue"> |
− | # <img <nowiki>src="http://mysite.com/mypic.png"</nowiki>> | + | </font># <img <nowiki>src="http://mysite.com/mypic.png"</nowiki>><font color="blue"> |
− | http|ftp|gopher # make sure we find a resource type | + | http|ftp|gopher </font># make sure we find a resource type<font color="blue"> |
− | :\/\/ # ...needs to be followed by colon-slash-slash | + | :\/\/ </font># ...needs to be followed by colon-slash-slash<font color="blue"> |
− | [^ \n\r]+ # some stuff than space, newline, tab is in URL | + | [^ \n\r]+ </font># some stuff than space, newline, tab is in URL<font color="blue"> |
− | (?=[\s\.,]) # assert: followed by whitespace/period/comma | + | (?=[\s\.,]) </font># assert: followed by whitespace/period/comma<font color="blue"> |
/ | / | ||
</font> | </font> | ||
Line 322: | Line 329: | ||
* ''Programming Perl'' (1996), Larry Wall, Tom Christiansen & Randal L. Schwartz, O'Reilly, Cambridge, MA. | * ''Programming Perl'' (1996), Larry Wall, Tom Christiansen & Randal L. Schwartz, O'Reilly, Cambridge, MA. | ||
− | == | + | ==See also== |
− | * [http:// | + | *[http://pyparsing.wikispaces.com/ Pyparsing] |
− | * [http://www.regexlib.com/ Regular Expression Library] — currently contains over 1000 expressions from contributors around the world. | + | |
− | * [http://www.regular-expressions.info/ Regular-Expressions.info] — one of the most comprehensive, free regular expression tutorials on the net. | + | ==External links== |
− | === Other === | + | *[http://www.regexlib.com/ Regular Expression Library] — currently contains over 1000 expressions from contributors around the world. |
− | * [http://www.txt2re.com/ txt2re.com] | + | *[http://www.regular-expressions.info/ Regular-Expressions.info] — one of the most comprehensive, free regular expression tutorials on the net. |
+ | *[http://www.pcre.org/ PCRE - Perl Compatible Regular Expressions] | ||
+ | **[http://perldoc.perl.org/perlre.html perlre] — Perl regular expressions | ||
+ | *[http://www.csm.astate.edu/~rossa/regular.html Regular expressions and commands that use them] | ||
+ | *[http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html Regular expressions] — by The Open Group Base Specifications Issue 6 | ||
+ | *[http://osteele.com/tools/rework/ reWork: a regular expression workbench] | ||
+ | *[http://regexpal.com/ RegexPal] — a JavaScript regular expression tester | ||
+ | *[[wikipedia:Regular expression]] | ||
+ | *[http://regexadvice.com/forums/ RegexAdvice - forum] — ("''cc''") | ||
+ | ===Other=== | ||
+ | *[http://www.txt2re.com/ txt2re.com] | ||
[[Category:Technical and Specialized Skills]] | [[Category:Technical and Specialized Skills]] |
Latest revision as of 21:46, 9 March 2009
A regular expression (abbreviated as regexp, regex, or regxp, with plural forms regexps, regexes, or regexen) is a string that describes or matches a set of strings, according to certain syntax rules. Regular expressions are used by many text editors and utilities to search and manipulate bodies of text based on certain patterns. Many programming languages support regular expressions for string manipulation. For example, Perl and Tcl have a powerful regular expression engine built directly into their syntax. The set of utilities (including the editor sed and the filter grep) provided by Unix distributions were the first to popularize the concept of regular expressions.
see also: Evolved regular expressions (EREs)
Contents
Character classes (with GNU/POSIX extensions)
[[:alnum:]] -> [A-Za-z0-9] # Alphanumeric characters [[:alpha:]] -> [A-Za-z] # Alphabetic characters [[:blank:]] -> [ \x09] # Space or tab characters only [[:cntrl:]] -> [\x00-\x19\x7F] # Control characters [[:digit:]] -> [0-9] # Numeric characters [[:graph:]] -> [!-~] # Printable and visible characters [[:lower:]] -> [a-z] # Lower-case alphabetic characters [[:print:]] -> [ -~] # Printable (non-Control) characters [[:punct:]] -> [!-/:-@[-`{-~] # Punctuation characters [[:space:]] -> [ \t\v\f] # All whitespace characters [[:upper:]] -> [A-Z] # Upper-case alphabetic characters [[:xdigit:]] -> [0-9a-fA-F] # Hexadecimal digit characters (/[\dA-Fa-f]+/)
Metacharacters
General quantifiers
* -> {0,} # match preceding item zero or more times + -> {1,} # match preceding item one or more times ? -> {0,1} # match preceding item zero or one time (i.e. optional)
Examples
Character literals
Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go.
regex : /Mary/ matches: Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go.
"Escaped" characters literals
regex : /.*/ matches: Special characters must be escaped.*
regex : /\.\*/ matches: Special characters must be escaped.*
Positional special characters
regex : /^Mary/ matches: Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go.
regex : /Mary$/ matches: Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go.
The "wildcard" character
/.a/
Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go.
Grouping regular expressions
/(Mary)( )(had)/
Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go.
Character classes
/[a-z]a/
Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go.
Complement operator
/[^a-z]a/
Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go.
Alternation of patterns
/cat|dog|bird/
The pet store sold cats, dogs, and birds.
/=first|second=/
=first first= # =second second= # =first= # =second=
/(=)(first)|(second)(=)/
=first first= # =second second= # =first= # =second=
/=(first|second)=/
=first first= # =second second= # =first= # =second=
The basic abstract quantifier
/@(=+=)*@/
Match with zero in the middle: @@ Subexpresion occurs, but...: @=+=ABC@ Lots of occurrences: @=+==+==+==+==+=@ Must repeat entire pattern: @=+==+=+==+=@
Matching Patterns in Text: Intermediate
More abstract quantifiers
/A+B*C?D
AAAD ABBBBCD BBBCD ABCCD AAABBBC
Numeric quantifiers
/a{5} b{,6} c{4,8}/
aaaaa bbbbb ccccc aaa bbb ccc aaaaa bbbbbbbbbbbbbb ccccc
/a+ b{3,} c?/
aaaaa bbbbb ccccc aaa bbb ccc aaaaa bbbbbbbbbbbbbb ccccc
/a{5} b{6,} c{4,8}/
aaaaa bbbbb ccccc aaa bbb ccc aaaaa bbbbbbbbbbbbbb ccccc
Backreferences
/(abc|xyz) \1/
jkl abc xyz jkl xyz abc jkl abc abc jkl xyz xyz
/(abc|xyz) (abc|xyz)/
jkl abc xyz jkl xyz abc jkl abc abc jkl xyz xyz
Don't match more than you want to
/th.*s/
-- I want to match the words that start -- with 'th' and end with 's'. this thus thistle this line matches too much
Tricks for restraining matches
/th[^s]*./
-- I want to match the words that start -- with 'th' and end with 's'. this thus thistle this line matches too much
A literal-string modification example
s/cat/dog/g
< The zoo had wild dogs, bobcats, lions, and other wild cats. > The zoo had wild dogs, bobdogs, lions, and other wild dogs.
A pattern-match modification example
s/cat|dog/snake/g
< The zoo had wild dogs, bobcats, lions, and other wild cats. > The zoo had wild snakes, bobsnakes, lions, and other wild snakes.
s/[a-z]+i[a-z]*/nice/g
< The zoo had wild dogs, bobcats, lions, and other wild cats. > The zoo had nice dogs, bobcats, nice, and other nice cats.
Modification using backreferences
sed -r 's/([A-Z])([0-9]{2,4}) /\2:\1 /g' INPUT
INPUT : A37 B4 C107 D54112 E1103 XXX OUTPUT: 37:A B4 107:C D54112 1103:E XXX
Match a US telephone number
((\([2-9][0-9]{2}\))?\ ?|[2-9][0-9]{2}(?:\-?|\ ?))[2-9][0-9]{2}[- ]?[0-9]{4}
This regexp matches US telephone numbers in any of 15 formats:
(NPA) PRE-SUFF (NPA) PRE SUFF (NPA) PRESUFF (NPA)PRE-SUFF (NPA)PRE SUFF (NPA)PRESUFF NPA PRE-SUFF NPA PRE SUFF NPA PRESUFF NPAPRE-SUFF NPAPRE SUFF NPAPRESUFF PRE-SUFF PRE SUFF PRESUFF
Advanced Regular Expression Extensions
Non-greedy quantifiers
/th.*s/
-- I want to match the words that start -- with 'th' and end with 's'. this line matches just right this # thus # thistle
/th.*?s/
-- I want to match the words that start -- with 'th' and end with 's'. this # thus # thistle this line matches just right
/th.*?s /
-- I want to match the words that start -- with 'th' and end with 's'. (FINALLY!) this # thus # thistle this line matches just right
Pattern-match modifiers
/M.*[ise] /
MAINE # Massachusetts # Colorado # mississippi # Missouri # Minnesota #
/M.*[ise] /i
MAINE # Massachusetts # Colorado # mississippi # Missouri # Minnesota #
/M.*[ise] /gis
MAINE # Massachusetts # Colorado # mississippi # Missouri # Minnesota #
Changing back-reference behaviour
s/([A-Z])(?:-[a-z]{3}-)([0-9]*)/\1\2/g
< A-xyz-37 # B:abcd:142 # C-wxy-66 # D-qrs-93 > A37 # B:abcd:42 # C66 # D93
Naming back-references
import re txt = "A-xyz-37 # B:abcd:142 # C-wxy-66 # D-qrs-93" print re.sub("(?P<prefix>[A-Z])(-[a-z]{3}-)(?P<id>[0-9]*)", "\g<prefix>\g<id>", txt)
A37 # B:abcd:42 # C66 # D93
Lookahead assertions
s/([A-Z]-)(?=[a-z]{3})([a-z0-9]* )/\2\1/g
< A-xyz37 # B-ab6142 # C-Wxy66 # D-qrs93 > xyz37A- # B-ab6142 # C-Wxy66 # qrs93D-
s/([A-Z]-)(?![a-z]{3})([a-z0-9]* )/\2\1/g
< A-xyz37 # B-ab6142 # C-Wxy66 # D-qrs93 > A-xyz37 # ab6142B- # Wxy66C- # D-qrs93
Making regular expressions more readable
/ # identify URLs within a text file [^="] # do not match URLs in IMG tags like: # <img src="http://mysite.com/mypic.png"> http|ftp|gopher # make sure we find a resource type :\/\/ # ...needs to be followed by colon-slash-slash [^ \n\r]+ # some stuff than space, newline, tab is in URL (?=[\s\.,]) # assert: followed by whitespace/period/comma / The URL for my site is: http://mysite.com/mydoc.html. You might also enjoy ftp://yoursite.com/index.html for a good place to download files.
References
- TCL/TK in a Nutshell (1999), Paul Raines & Jeff Tranter, O'Reilly, Cambridge, MA.
- Python Pocket Reference (1998), Mark Lutz, O'Reilly, Cambridge, MA.
- Mastering Regular Expressions (1997), Friedl, Jeffrey E. F., O'Reilly, Cambridge, MA.
- sed & awk (1997), Dale Dougherty & Arnold Robbins, O'Reilly, Cambridge, MA.
- A Practical Guide to Linux (1997), Mark G. Sobell, Addison Wesley, Reading, MA.
- Programming Perl (1996), Larry Wall, Tom Christiansen & Randal L. Schwartz, O'Reilly, Cambridge, MA.
See also
External links
- Regular Expression Library — currently contains over 1000 expressions from contributors around the world.
- Regular-Expressions.info — one of the most comprehensive, free regular expression tutorials on the net.
- PCRE - Perl Compatible Regular Expressions
- perlre — Perl regular expressions
- Regular expressions and commands that use them
- Regular expressions — by The Open Group Base Specifications Issue 6
- reWork: a regular expression workbench
- RegexPal — a JavaScript regular expression tester
- wikipedia:Regular expression
- RegexAdvice - forum — ("cc")