Difference between revisions of "Regular expression"
Line 21: | Line 21: | ||
=== Character literals === | === Character literals === | ||
− | M<font color="red" | + | M<font color="red">a</font>ry h<font color="red">a</font>d <font color="red">a</font> little l<font color="red">a</font>mb. |
− | And everywhere th<font color="red" | + | And everywhere th<font color="red">a</font>t M<font color="red">a</font>ry went, the l<font color="red">a</font>mb w<font color="red">a</font>s sure to go. |
regex : <font color="blue">/Mary/</font> | regex : <font color="blue">/Mary/</font> | ||
− | matches: <font color="red" | + | matches: <font color="red">Mary</font> had a little lamb. And everywhere that <font color="red">Mary</font> went, the lamb was sure to go. |
=== "Escaped" characters literals === | === "Escaped" characters literals === | ||
regex : <font color="blue">/.*/</font> | regex : <font color="blue">/.*/</font> | ||
− | matches: <font color="red" | + | matches: <font color="red">Special characters must be escaped.*</font> |
regex : <font color="blue">/\.\*/</font> | regex : <font color="blue">/\.\*/</font> | ||
− | matches: Special characters must be escaped<font color="red" | + | matches: Special characters must be escaped<font color="red">.*</font> |
=== Positional special characters === | === Positional special characters === | ||
regex : <font color="blue">/^Mary/</font> | regex : <font color="blue">/^Mary/</font> | ||
− | matches: <font color="red" | + | matches: <font color="red">Mary</font> had a little lamb. |
And everywhere that Mary | And everywhere that Mary | ||
went, the lamb was sure | went, the lamb was sure | ||
Line 45: | Line 45: | ||
regex : <font color="blue">/Mary$/</font> | regex : <font color="blue">/Mary$/</font> | ||
matches: Mary had a little lamb. | matches: Mary had a little lamb. | ||
− | And everywhere that <font color="red" | + | And everywhere that <font color="red">Mary</font> |
went, the lamb was sure | went, the lamb was sure | ||
to go. | to go. | ||
Line 53: | Line 53: | ||
<font color="blue">/.a/ </font> | <font color="blue">/.a/ </font> | ||
− | <font color="red" | + | <font color="red">Ma</font>ry <font color="red">ha</font>d <font color="red">a</font> little <font color="red">la</font>mb. |
− | And everywhere t<font color="red" | + | And everywhere t<font color="red">ha</font>t <font color="red">Ma</font>ry |
− | went, the <font color="red" | + | went, the <font color="red">la</font>mb <font color="red">wa</font>s sure |
to go. | to go. | ||
Line 62: | Line 62: | ||
<font color="blue">/(Mary)( )(had)/ </font> | <font color="blue">/(Mary)( )(had)/ </font> | ||
− | <font color="red" | + | <font color="red">Mary had</font> a little lamb. |
And everywhere that Mary | And everywhere that Mary | ||
went, the lamb was sure | went, the lamb was sure | ||
Line 71: | Line 71: | ||
<font color="blue">/[a-z]a/ </font> | <font color="blue">/[a-z]a/ </font> | ||
− | Mary <font color="red" | + | Mary <font color="red">ha</font>d a little <font color="red">la</font>mb. |
− | And everywhere t<font color="red" | + | And everywhere t<font color="red">ha</font>t Mary |
− | went, the <font color="red" | + | went, the <font color="red">la</font>mb <font color="red">wa</font>s sure |
to go. | to go. | ||
Line 80: | Line 80: | ||
<font color="blue">/[^a-z]a/ </font> | <font color="blue">/[^a-z]a/ </font> | ||
− | <font color="red" | + | <font color="red">Ma</font>ry had <font color="red">a</font> little lamb. |
− | And everywhere that <font color="red" | + | And everywhere that <font color="red">Ma</font>ry |
went, the lamb was sure | went, the lamb was sure | ||
to go. | to go. | ||
Line 89: | Line 89: | ||
<font color="blue">/cat|dog|bird/</font> | <font color="blue">/cat|dog|bird/</font> | ||
− | The pet store sold <font color="red" | + | The pet store sold <font color="red">cat</font>s, <font color="red">dog</font>s, and <font color="red">bird</font>s. |
<font color="blue">/=first|second=/</font> | <font color="blue">/=first|second=/</font> | ||
− | <font color="red" | + | <font color="red">=first</font> first= # =second <font color="red">second=</font> # <font color="red">=first</font>= # =<font color="red">second=</font> |
<font color="blue">/(=)(first)|(second)(=)/</font> | <font color="blue">/(=)(first)|(second)(=)/</font> | ||
− | <font color="red" | + | <font color="red">=first</font> first= # =second <font color="red">second=</font> # <font color="red">=first</font>= # =<font color="red">second=</font> |
<font color="blue">/=(first|second)=/</font> | <font color="blue">/=(first|second)=/</font> | ||
− | =first first= # =second second= # <font color="red" | + | =first first= # =second second= # <font color="red">=first=</font> # <font color="red">=second=</font> |
=== The basic abstract quantifier === | === The basic abstract quantifier === | ||
Line 107: | Line 107: | ||
<font color="blue">/@(=+=)*@/ </font> | <font color="blue">/@(=+=)*@/ </font> | ||
− | Match with zero in the middle: <font color="red" | + | Match with zero in the middle: <font color="red">@@</font> |
Subexpresion occurs, but...: @=+=ABC@ | Subexpresion occurs, but...: @=+=ABC@ | ||
− | Lots of occurrences: <font color="red" | + | Lots of occurrences: <font color="red">@=+==+==+==+==+=@</font> |
Must repeat entire pattern: @=+==+=+==+=@ | Must repeat entire pattern: @=+==+=+==+=@ | ||
Line 118: | Line 118: | ||
<font color="blue">/A+B*C?D</font> | <font color="blue">/A+B*C?D</font> | ||
− | <font color="red" | + | <font color="red">AAAD</font> |
− | <font color="red" | + | <font color="red">ABBBBCD</font> |
BBBCD | BBBCD | ||
ABCCD | ABCCD | ||
Line 128: | Line 128: | ||
<font color="blue">/a{5} b{,6} c{4,8}/</font> | <font color="blue">/a{5} b{,6} c{4,8}/</font> | ||
− | <font color="red" | + | <font color="red">aaaaa bbbbb ccccc</font> |
aaa bbb ccc | aaa bbb ccc | ||
aaaaa bbbbbbbbbbbbbb ccccc | aaaaa bbbbbbbbbbbbbb ccccc | ||
Line 134: | Line 134: | ||
<font color="blue">/a+ b{3,} c?/</font> | <font color="blue">/a+ b{3,} c?/</font> | ||
− | <font color="red" | + | <font color="red">aaaaa bbbbb c</font>cccc |
− | <font color="red" | + | <font color="red">aaa bbb c</font>cc |
− | <font color="red" | + | <font color="red">aaaaa bbbbbbbbbbbbbb c</font>cccc |
<font color="blue">/a{5} b{6,} c{4,8}/</font> | <font color="blue">/a{5} b{6,} c{4,8}/</font> | ||
Line 142: | Line 142: | ||
aaaaa bbbbb ccccc | aaaaa bbbbb ccccc | ||
aaa bbb ccc | aaa bbb ccc | ||
− | <font color="red" | + | <font color="red">aaaaa bbbbbbbbbbbbbb ccccc</font> |
=== Backreferences === | === Backreferences === | ||
Line 150: | Line 150: | ||
jkl abc xyz | jkl abc xyz | ||
jkl xyz abc | jkl xyz abc | ||
− | jkl <font color="red" | + | jkl <font color="red">abc abc</font> |
− | jkl <font color="red" | + | jkl <font color="red">xyz xyz</font> |
<font color="blue">/(abc|xyz) (abc|xyz)/</font> | <font color="blue">/(abc|xyz) (abc|xyz)/</font> | ||
− | jkl <font color="red" | + | jkl <font color="red">abc xyz</font> |
− | jkl <font color="red" | + | jkl <font color="red">xyz abc</font> |
− | jkl <font color="red" | + | jkl <font color="red">abc abc</font> |
− | jkl <font color="red" | + | jkl <font color="red">xyz xyz</font> |
=== Don't match more than you want to === | === Don't match more than you want to === | ||
Line 290: | Line 290: | ||
/ | / | ||
</font> | </font> | ||
− | The URL for my site is: <font color="red" | + | The URL for my site is: <font color="red"><nowiki>http://mysite.com/mydoc.html</nowiki></font>. You |
− | might also enjoy <font color="red" | + | might also enjoy <font color="red"><nowiki>ftp://yoursite.com/index.html</nowiki></font> for a good |
place to download files. | place to download files. | ||
Revision as of 07:13, 16 August 2006
A regular expression (abbreviated as regexp, regex, or regxp, with plural forms regexps, regexes, or regexen) is a string that describes or matches a set of strings, according to certain syntax rules. Regular expressions are used by many text editors and utilities to search and manipulate bodies of text based on certain patterns. Many programming languages support regular expressions for string manipulation. For example, Perl and Tcl have a powerful regular expression engine built directly into their syntax. The set of utilities (including the editor sed and the filter grep) provided by Unix distributions were the first to popularize the concept of regular expressions.
Contents
- 1 Examples
- 1.1 GNU/POSIX extensions to regular expressions
- 1.2 Character literals
- 1.3 "Escaped" characters literals
- 1.4 Positional special characters
- 1.5 The "wildcard" character
- 1.6 Grouping regular expressions
- 1.7 Character classes
- 1.8 Complement operator
- 1.9 Alternation of patterns
- 1.10 The basic abstract quantifier
- 2 Matching Patterns in Text: Intermediate
- 3 Advanced Regular Expression Extensions
- 4 References
- 5 External links
Examples
GNU/POSIX extensions to regular expressions
[[:alnum:]] - [A-Za-z0-9] Alphanumeric characters [[:alpha:]] - [A-Za-z] Alphabetic characters [[:blank:]] - [ \x09] Space or tab characters only [[:cntrl:]] - [\x00-\x19\x7F] Control characters [[:digit:]] - [0-9] Numeric characters [[:graph:]] - [!-~] Printable and visible characters [[:lower:]] - [a-z] Lower-case alphabetic characters [[:print:]] - [ -~] Printable (non-Control) characters [[:punct:]] - [!-/:-@[-`{-~] Punctuation characters [[:space:]] - [ \t\v\f] All whitespace chars [[:upper:]] - [A-Z] Upper-case alphabetic characters [[:xdigit:]] - [0-9a-fA-F] Hexadecimal digit characters
Character literals
Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go.
regex : /Mary/ matches: Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go.
"Escaped" characters literals
regex : /.*/ matches: Special characters must be escaped.*
regex : /\.\*/ matches: Special characters must be escaped.*
Positional special characters
regex : /^Mary/ matches: Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go.
regex : /Mary$/ matches: Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go.
The "wildcard" character
/.a/
Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go.
Grouping regular expressions
/(Mary)( )(had)/
Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go.
Character classes
/[a-z]a/
Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go.
Complement operator
/[^a-z]a/
Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go.
Alternation of patterns
/cat|dog|bird/
The pet store sold cats, dogs, and birds.
/=first|second=/
=first first= # =second second= # =first= # =second=
/(=)(first)|(second)(=)/
=first first= # =second second= # =first= # =second=
/=(first|second)=/
=first first= # =second second= # =first= # =second=
The basic abstract quantifier
/@(=+=)*@/
Match with zero in the middle: @@ Subexpresion occurs, but...: @=+=ABC@ Lots of occurrences: @=+==+==+==+==+=@ Must repeat entire pattern: @=+==+=+==+=@
Matching Patterns in Text: Intermediate
More abstract quantifiers
/A+B*C?D
AAAD ABBBBCD BBBCD ABCCD AAABBBC
Numeric quantifiers
/a{5} b{,6} c{4,8}/
aaaaa bbbbb ccccc aaa bbb ccc aaaaa bbbbbbbbbbbbbb ccccc
/a+ b{3,} c?/
aaaaa bbbbb ccccc aaa bbb ccc aaaaa bbbbbbbbbbbbbb ccccc
/a{5} b{6,} c{4,8}/
aaaaa bbbbb ccccc aaa bbb ccc aaaaa bbbbbbbbbbbbbb ccccc
Backreferences
/(abc|xyz) \1/
jkl abc xyz jkl xyz abc jkl abc abc jkl xyz xyz
/(abc|xyz) (abc|xyz)/
jkl abc xyz jkl xyz abc jkl abc abc jkl xyz xyz
Don't match more than you want to
/th.*s/
-- I want to match the words that start -- with 'th' and end with 's'. this thus thistle this line matches too much
Tricks for restraining matches
/th[^s]*./
-- I want to match the words that start -- with 'th' and end with 's'. this thus thistle this line matches too much
A literal-string modification example
s/cat/dog/g
< The zoo had wild dogs, bobcats, lions, and other wild cats. > The zoo had wild dogs, bobdogs, lions, and other wild dogs.
A pattern-match modification example
s/cat|dog/snake/g
< The zoo had wild dogs, bobcats, lions, and other wild cats. > The zoo had wild snakes, bobsnakes, lions, and other wild snakes.
s/[a-z]+i[a-z]*/nice/g
< The zoo had wild dogs, bobcats, lions, and other wild cats. > The zoo had nice dogs, bobcats, nice, and other nice cats.
Modification using backreferences
sed -r 's/([A-Z])([0-9]{2,4}) /\2:\1 /g' INPUT
INPUT : A37 B4 C107 D54112 E1103 XXX OUTPUT: 37:A B4 107:C D54112 1103:E XXX
Advanced Regular Expression Extensions
Non-greedy quantifiers
/th.*s/
-- I want to match the words that start -- with 'th' and end with 's'. this line matches just right this # thus # thistle
/th.*?s/
-- I want to match the words that start -- with 'th' and end with 's'. this # thus # thistle this line matches just right
/th.*?s /
-- I want to match the words that start -- with 'th' and end with 's'. (FINALLY!) this # thus # thistle this line matches just right
Pattern-match modifiers
/M.*[ise] /
MAINE # Massachusetts # Colorado # mississippi # Missouri # Minnesota #
/M.*[ise] /i
MAINE # Massachusetts # Colorado # mississippi # Missouri # Minnesota #
/M.*[ise] /gis
MAINE # Massachusetts # Colorado # mississippi # Missouri # Minnesota #
Changing backreference behavior
s/([A-Z])(?:-[a-z]{3}-)([0-9]*)/\1\2/g
< A-xyz-37 # B:abcd:142 # C-wxy-66 # D-qrs-93 > A37 # B:abcd:42 # C66 # D93
Naming backreferences
import re txt = "A-xyz-37 # B:abcd:142 # C-wxy-66 # D-qrs-93" print re.sub("(?P<prefix>[A-Z])(-[a-z]{3}-)(?P<id>[0-9]*)", "\g<prefix>\g<id>", txt)
A37 # B:abcd:42 # C66 # D93
Lookahead assertions
s/([A-Z]-)(?=[a-z]{3})([a-z0-9]* )/\2\1/g
< A-xyz37 # B-ab6142 # C-Wxy66 # D-qrs93 > xyz37A- # B-ab6142 # C-Wxy66 # qrs93D-
s/([A-Z]-)(?![a-z]{3})([a-z0-9]* )/\2\1/g
< A-xyz37 # B-ab6142 # C-Wxy66 # D-qrs93 > A-xyz37 # ab6142B- # Wxy66C- # D-qrs93
Making regular expressions more readable
/ # identify URLs within a text file [^="] # do not match URLs in IMG tags like: # <img src="http://mysite.com/mypic.png"> http|ftp|gopher # make sure we find a resource type :\/\/ # ...needs to be followed by colon-slash-slash [^ \n\r]+ # some stuff than space, newline, tab is in URL (?=[\s\.,]) # assert: followed by whitespace/period/comma / The URL for my site is: http://mysite.com/mydoc.html. You might also enjoy ftp://yoursite.com/index.html for a good place to download files.
References
- TCL/TK in a Nutshell (1999), Paul Raines & Jeff Tranter, O'Reilly, Cambridge, MA.
- Python Pocket Reference (1998), Mark Lutz, O'Reilly, Cambridge, MA.
- Mastering Regular Expressions (1997), Friedl, Jeffrey E. F., O'Reilly, Cambridge, MA.
- sed & awk (1997), Dale Dougherty & Arnold Robbins, O'Reilly, Cambridge, MA.
- A Practical Guide to Linux (1997), Mark G. Sobell, Addison Wesley, Reading, MA.
- Programming Perl (1996), Larry Wall, Tom Christiansen & Randal L. Schwartz, O'Reilly, Cambridge, MA.
External links
- Wikipedia article on Regular expression
- Regular Expression Library — currently contains over 1000 expressions from contributors around the world.
- Regular-Expressions.info — one of the most comprehensive, free regular expression tutorials on the net.