Difference between revisions of "Regular expression"

From Christoph's Personal Wiki
Jump to: navigation, search
(Making regular expressions more readable)
(External links)
 
(24 intermediate revisions by the same user not shown)
Line 1: Line 1:
A '''regular expression''' (abbreviated as regexp, regex, or regxp, with plural forms regexps, regexes, or regexen) is a string that describes or matches a set of strings, according to certain syntax rules. Regular expressions are used by many text editors and utilities to search and manipulate bodies of text based on certain patterns. Many programming languages support regular expressions for string manipulation. For example, Perl and Tcl have a powerful regular expression engine built directly into their syntax. The set of utilities (including the editor sed and the filter grep) provided by Unix distributions were the first to popularize the concept of regular expressions.
+
A '''regular expression''' (abbreviated as '''regexp''', '''regex''', or '''regxp''', with plural forms '''regexps''', '''regexes''', or '''regexen''') is a string that describes or matches a set of strings, according to certain syntax rules. Regular expressions are used by many text editors and utilities to search and manipulate bodies of text based on certain patterns. Many programming languages support regular expressions for string manipulation. For example, [[Perl]] and Tcl have a powerful regular expression engine built directly into their syntax. The set of utilities (including the editor [[Sed|sed]] and the filter [[Grep (command)|grep]]) provided by [[Linux|Unix]] distributions were the first to popularize the concept of regular expressions.
  
== Examples ==
+
see also: [[Evolved regular expressions]] (EREs)
  
 +
==Character classes (with GNU/POSIX extensions)==
 +
<pre>
 +
[[:alnum:]]  -> [A-Za-z0-9]    # Alphanumeric characters
 +
[[:alpha:]]  -> [A-Za-z]        # Alphabetic characters
 +
[[:blank:]]  -> [ \x09]        # Space or tab characters only
 +
[[:cntrl:]]  -> [\x00-\x19\x7F] # Control characters
 +
[[:digit:]]  -> [0-9]          # Numeric characters
 +
[[:graph:]]  -> [!-~]          # Printable and visible characters
 +
[[:lower:]]  -> [a-z]          # Lower-case alphabetic characters
 +
[[:print:]]  -> [ -~]          # Printable (non-Control) characters
 +
[[:punct:]]  -> [!-/:-@[-`{-~]  # Punctuation characters
 +
[[:space:]]  -> [ \t\v\f]      # All whitespace characters
 +
[[:upper:]]  -> [A-Z]          # Upper-case alphabetic characters
 +
[[:xdigit:]] -> [0-9a-fA-F]    # Hexadecimal digit characters (/[\dA-Fa-f]+/)
 +
</pre>
 +
 +
==Metacharacters==
 +
===General quantifiers===
 +
* -> {0,}  # match preceding item zero or more times
 +
+ -> {1,}  # match preceding item one or more times
 +
? -> {0,1}  # match preceding item zero or one time (i.e. optional)
 +
 +
==Examples==
 
=== Character literals ===
 
=== Character literals ===
  
  M<font color="red"><b>a</b></font>ry h<font color="red"><b>a</b></font>d <font color="red"><b>a</b></font> little l<font color="red"><b>a</b></font>mb.
+
  M<font color="red">a</font>ry h<font color="red">a</font>d <font color="red">a</font> little l<font color="red">a</font>mb.
  And everywhere th<font color="red"><b>a</b></font>t M<font color="red"><b>a</b></font>ry went, the l<font color="red"><b>a</b></font>mb w<font color="red"><b>a</b></font>s sure to go.
+
  And everywhere th<font color="red">a</font>t M<font color="red">a</font>ry went, the l<font color="red">a</font>mb w<font color="red">a</font>s sure to go.
  
 
  regex  : <font color="blue">/Mary/</font>
 
  regex  : <font color="blue">/Mary/</font>
  matches: <font color="red"><b>Mary</b></font> had a little lamb. And everywhere that <font color="red"><b>Mary</b></font> went, the lamb was sure to go.
+
  matches: <font color="red">Mary</font> had a little lamb. And everywhere that <font color="red">Mary</font> went, the lamb was sure to go.
  
 
=== "Escaped" characters literals ===
 
=== "Escaped" characters literals ===
  
 
  regex  : <font color="blue">/.*/</font>
 
  regex  : <font color="blue">/.*/</font>
  matches: <font color="red"><b>Special characters must be escaped.*</b></font>
+
  matches: <font color="red">Special characters must be escaped.*</font>
  
 
  regex  : <font color="blue">/\.\*/</font>
 
  regex  : <font color="blue">/\.\*/</font>
  matches: Special characters must be escaped<font color="red"><b>.*</b></font>
+
  matches: Special characters must be escaped<font color="red">.*</font>
  
 
=== Positional special characters ===  
 
=== Positional special characters ===  
  
 
  regex  : <font color="blue">/^Mary/</font>
 
  regex  : <font color="blue">/^Mary/</font>
  matches: <font color="red"><b>Mary</b></font> had a little lamb.
+
  matches: <font color="red">Mary</font> had a little lamb.
 
           And everywhere that Mary
 
           And everywhere that Mary
 
           went, the lamb was sure
 
           went, the lamb was sure
Line 29: Line 52:
 
  regex  : <font color="blue">/Mary$/</font>
 
  regex  : <font color="blue">/Mary$/</font>
 
  matches: Mary had a little lamb.
 
  matches: Mary had a little lamb.
           And everywhere that <font color="red"><b>Mary</b></font>
+
           And everywhere that <font color="red">Mary</font>
 
           went, the lamb was sure
 
           went, the lamb was sure
 
           to go.
 
           to go.
Line 37: Line 60:
 
  <font color="blue">/.a/ </font>
 
  <font color="blue">/.a/ </font>
  
  <font color="red"><b>Ma</b></font>ry <font color="red"><b>ha</b></font>d<font color="red"><b> a</b></font> little <font color="red"><b>la</b></font>mb.
+
  <font color="red">Ma</font>ry <font color="red">ha</font>d <font color="red">a</font> little <font color="red">la</font>mb.
  And everywhere t<font color="red"><b>ha</b></font>t <font color="red"><b>Ma</b></font>ry
+
  And everywhere t<font color="red">ha</font>t <font color="red">Ma</font>ry
  went, the <font color="red"><b>la</b></font>mb <font color="red"><b>wa</b></font>s sure
+
  went, the <font color="red">la</font>mb <font color="red">wa</font>s sure
 
  to go.
 
  to go.
  
Line 46: Line 69:
 
  <font color="blue">/(Mary)( )(had)/ </font>
 
  <font color="blue">/(Mary)( )(had)/ </font>
  
  <font color="red"><b>Mary had</b></font> a little lamb.
+
  <font color="red">Mary had</font> a little lamb.
 
  And everywhere that Mary
 
  And everywhere that Mary
 
  went, the lamb was sure
 
  went, the lamb was sure
Line 55: Line 78:
 
  <font color="blue">/[a-z]a/ </font>
 
  <font color="blue">/[a-z]a/ </font>
  
  Mary <font color="red"><b>ha</b></font>d a little <font color="red"><b>la</b></font>mb.
+
  Mary <font color="red">ha</font>d a little <font color="red">la</font>mb.
  And everywhere t<font color="red"><b>ha</b></font>t Mary
+
  And everywhere t<font color="red">ha</font>t Mary
  went, the <font color="red"><b>la</b></font>mb <font color="red"><b>wa</b></font>s sure
+
  went, the <font color="red">la</font>mb <font color="red">wa</font>s sure
 
  to go.
 
  to go.
  
Line 64: Line 87:
 
  <font color="blue">/[^a-z]a/ </font>
 
  <font color="blue">/[^a-z]a/ </font>
  
  <font color="red"><b>Ma</b></font>ry had<font color="red"><b> a</b></font> little lamb.
+
  <font color="red">Ma</font>ry had <font color="red">a</font> little lamb.
  And everywhere that <font color="red"><b>Ma</b></font>ry
+
  And everywhere that <font color="red">Ma</font>ry
 
  went, the lamb was sure
 
  went, the lamb was sure
 
  to go.
 
  to go.
Line 73: Line 96:
 
  <font color="blue">/cat|dog|bird/</font>
 
  <font color="blue">/cat|dog|bird/</font>
  
  The pet store sold <font color="red"><b>cat</b></font>s, <font color="red"><b>dog</b></font>s, and <font color="red"><b>bird</b></font>s.
+
  The pet store sold <font color="red">cat</font>s, <font color="red">dog</font>s, and <font color="red">bird</font>s.
  
 
  <font color="blue">/=first|second=/</font>
 
  <font color="blue">/=first|second=/</font>
  
  <font color="red"><b>=first</b></font> first= # =second <font color="red"><b>second=</b></font> # <font color="red"><b>=first</b></font>= # =<font color="red"><b>second=</b></font>
+
  <font color="red">=first</font> first= # =second <font color="red">second=</font> # <font color="red">=first</font>= # =<font color="red">second=</font>
  
 
  <font color="blue">/(=)(first)|(second)(=)/</font>
 
  <font color="blue">/(=)(first)|(second)(=)/</font>
  
  <font color="red"><b>=first</b></font> first= # =second <font color="red"><b>second=</b></font> # <font color="red"><b>=first</b></font>= # =<font color="red"><b>second=</b></font>
+
  <font color="red">=first</font> first= # =second <font color="red">second=</font> # <font color="red">=first</font>= # =<font color="red">second=</font>
  
 
  <font color="blue">/=(first|second)=/</font>
 
  <font color="blue">/=(first|second)=/</font>
  
  =first first= # =second second= # <font color="red"><b>=first=</b></font> # <font color="red"><b>=second=</b></font>
+
  =first first= # =second second= # <font color="red">=first=</font> # <font color="red">=second=</font>
  
 
=== The basic abstract quantifier ===
 
=== The basic abstract quantifier ===
Line 91: Line 114:
 
  <font color="blue">/@(=+=)*@/ </font>
 
  <font color="blue">/@(=+=)*@/ </font>
  
Match with zero in the middle: <font color="red"><b>@@</b></font>
+
Match with zero in the middle: <font color="red">@@</font>
 
Subexpresion occurs, but...: @=+=ABC@
 
Subexpresion occurs, but...: @=+=ABC@
Lots of occurrences: <font color="red"><b>@=+==+==+==+==+=@</b></font>
+
Lots of occurrences: <font color="red">@=+==+==+==+==+=@</font>
 
Must repeat entire pattern: @=+==+=+==+=@
 
Must repeat entire pattern: @=+==+=+==+=@
  
Line 102: Line 125:
 
  <font color="blue">/A+B*C?D</font>
 
  <font color="blue">/A+B*C?D</font>
  
  <font color="red"><b>AAAD</b></font>
+
  <font color="red">AAAD</font>
  <font color="red"><b>ABBBBCD</b></font>
+
  <font color="red">ABBBBCD</font>
 
  BBBCD
 
  BBBCD
 
  ABCCD
 
  ABCCD
Line 112: Line 135:
 
  <font color="blue">/a{5} b{,6} c{4,8}/</font>
 
  <font color="blue">/a{5} b{,6} c{4,8}/</font>
  
  <font color="red"><b>aaaaa bbbbb ccccc</b></font>
+
  <font color="red">aaaaa bbbbb ccccc</font>
 
  aaa bbb ccc
 
  aaa bbb ccc
 
  aaaaa bbbbbbbbbbbbbb ccccc
 
  aaaaa bbbbbbbbbbbbbb ccccc
Line 118: Line 141:
 
  <font color="blue">/a+ b{3,} c?/</font>
 
  <font color="blue">/a+ b{3,} c?/</font>
  
  <font color="red"><b>aaaaa bbbbb c</b></font>cccc
+
  <font color="red">aaaaa bbbbb c</font>cccc
  <font color="red"><b>aaa bbb c</b></font>cc
+
  <font color="red">aaa bbb c</font>cc
  <font color="red"><b>aaaaa bbbbbbbbbbbbbb c</b></font>cccc
+
  <font color="red">aaaaa bbbbbbbbbbbbbb c</font>cccc
  
 
  <font color="blue">/a{5} b{6,} c{4,8}/</font>
 
  <font color="blue">/a{5} b{6,} c{4,8}/</font>
Line 126: Line 149:
 
  aaaaa bbbbb ccccc
 
  aaaaa bbbbb ccccc
 
  aaa bbb ccc
 
  aaa bbb ccc
  <font color="red"><b>aaaaa bbbbbbbbbbbbbb ccccc</b></font>
+
  <font color="red">aaaaa bbbbbbbbbbbbbb ccccc</font>
  
 
=== Backreferences ===
 
=== Backreferences ===
Line 134: Line 157:
 
  jkl abc xyz
 
  jkl abc xyz
 
  jkl xyz abc
 
  jkl xyz abc
  jkl <font color="red"><b>abc abc</b></font>
+
  jkl <font color="red">abc abc</font>
  jkl <font color="red"><b>xyz xyz</b></font>
+
  jkl <font color="red">xyz xyz</font>
  
 
  <font color="blue">/(abc|xyz) (abc|xyz)/</font>
 
  <font color="blue">/(abc|xyz) (abc|xyz)/</font>
  
  jkl <font color="red"><b>abc xyz</b></font>
+
  jkl <font color="red">abc xyz</font>
  jkl <font color="red"><b>xyz abc</b></font>
+
  jkl <font color="red">xyz abc</font>
  jkl <font color="red"><b>abc abc</b></font>
+
  jkl <font color="red">abc abc</font>
  jkl <font color="red"><b>xyz xyz</b></font>
+
  jkl <font color="red">xyz xyz</font>
  
 
=== Don't match more than you want to ===
 
=== Don't match more than you want to ===
Line 148: Line 171:
 
  <font color="blue">/th.*s/</font>
 
  <font color="blue">/th.*s/</font>
  
  -- I want to match <font color="red"><b>the words that s</b></font>tart
+
  -- I want to match <font color="red">the words that s</font>tart
  -- wi<font color="red"><b>th 'th' and end with 's</b></font>'.
+
  -- wi<font color="red">th 'th' and end with 's</font>'.
  <font color="red"><b>this</b></font>
+
  <font color="red">this</font>
  <font color="red"><b>thus</b></font>
+
  <font color="red">thus</font>
  <font color="red"><b>this</b></font>tle
+
  <font color="red">this</font>tle
  <font color="red"><b>this line matches</b></font> too much
+
  <font color="red">this line matches</font> too much
  
 
=== Tricks for restraining matches ===
 
=== Tricks for restraining matches ===
Line 159: Line 182:
 
  <font color="blue">/th[^s]*./</font>
 
  <font color="blue">/th[^s]*./</font>
  
  -- I want to match <font color="red"><b>the words</b></font> <font color="red"><b>that s</b></font>tart
+
  -- I want to match <font color="red">the words</font> <font color="red">that s</font>tart
  -- wi<font color="red"><b>th 'th' and end with 's</b></font>'.
+
  -- wi<font color="red">th 'th' and end with 's</font>'.
  <font color="red"><b>this</b></font>
+
  <font color="red">this</font>
  <font color="red"><b>thus</b></font>
+
  <font color="red">thus</font>
  <font color="red"><b>this</b></font>tle
+
  <font color="red">this</font>tle
  <font color="red"><b>this</b></font> line matches too much
+
  <font color="red">this</font> line matches too much
  
 
=== A literal-string modification example ===
 
=== A literal-string modification example ===
  
  <font color="blue">s/cat/dog/g </font>
+
  {{Regex/Replace|cat|dog|g}}
  
  &lt; The zoo had wild dogs, bobcats, lions, and other wild cats.
+
  &lt; The zoo had wild dogs, bob<font color="red">cat</font>s, lions, and other wild <font color="red">cat</font>s.
  &gt; The zoo had wild dogs, bob<font color="red"><b>dog</b></font>s, lions, and other wild <font color="red"><b>dog</b></font>s.
+
  &gt; The zoo had wild dogs, bob<font color="green">dog</font>s, lions, and other wild <font color="green">dog</font>s.
  
 
=== A pattern-match modification example ===
 
=== A pattern-match modification example ===
  
  <font color="blue">s/cat|dog/snake/g </font>
+
  {{Regex/Replace|cat<nowiki>|</nowiki>dog|snake|g}}
  
  &lt; The zoo had wild dogs, bobcats, lions, and other wild cats.
+
  &lt; The zoo had wild <font color="red">dog</font>s, bob<font color="red">cat</font>s, lions, and other wild <font color="red">cat</font>s.
  &gt; The zoo had wild <font color="red"><b>snake</b></font>s, bob<font color="red"><b>snake</b></font>s, lions, and other wild <font color="red"><b>snake</b></font>s.
+
  &gt; The zoo had wild <font color="green">snake</font>s, bob<font color="green">snake</font>s, lions, and other wild <font color="green">snake</font>s.
  
  <font color="blue">s/[a-z]+i[a-z]*/nice/g </font>
+
  {{Regex/Replace|[a-z]+i[a-z]*|nice|g}}
  
  &lt; The zoo had wild dogs, bobcats, lions, and other wild cats.
+
  &lt; The zoo had <font color="red">wild</font> dogs, bobcats, <font color="red">lions</font>, and other <font color="red">wild</font> cats.
  &gt; The zoo had <font color="red"><b>nice</b></font> dogs, bobcats, <font color="red"><b>nice</b></font>, and other <font color="red"><b>nice</b></font> cats.
+
  &gt; The zoo had <font color="green">nice</font> dogs, bobcats, <font color="green">nice</font>, and other <font color="green">nice</font> cats.
  
 
=== Modification using backreferences ===
 
=== Modification using backreferences ===
  
  sed -r ''''s/'''<font color="red">([A-Z])([0-9]{2,4}) </font>'''/'''<font color="green">\2:\1 </font>'''/g'''' ''INPUT''
+
  sed -r '{{Regex/Replace|([A-Z])([0-9]{2,4}) |\2:\1 |g}}' INPUT
  
 
  INPUT : <font color="red">A37</font> B4 <font color="red">C107</font> D54112 <font color="red">E1103</font> XXX
 
  INPUT : <font color="red">A37</font> B4 <font color="red">C107</font> D54112 <font color="red">E1103</font> XXX
 
  OUTPUT: <font color="green">37:A</font> B4 <font color="green">107:C</font> D54112 <font color="gren">1103:E</font> XXX
 
  OUTPUT: <font color="green">37:A</font> B4 <font color="green">107:C</font> D54112 <font color="gren">1103:E</font> XXX
 +
 +
=== Match a US telephone number ===
 +
((\([2-9][0-9]{2}\))?\ ?|[2-9][0-9]{2}(?:\-?|\ ?))[2-9][0-9]{2}[- ]?[0-9]{4}
 +
 +
This regexp matches US telephone numbers in any of 15 formats:
 +
(NPA) PRE-SUFF
 +
(NPA) PRE SUFF
 +
(NPA) PRESUFF
 +
(NPA)PRE-SUFF
 +
(NPA)PRE SUFF
 +
(NPA)PRESUFF
 +
NPA PRE-SUFF
 +
NPA PRE SUFF
 +
NPA PRESUFF
 +
NPAPRE-SUFF
 +
NPAPRE SUFF
 +
NPAPRESUFF
 +
PRE-SUFF
 +
PRE SUFF
 +
PRESUFF
  
 
== Advanced Regular Expression Extensions ==
 
== Advanced Regular Expression Extensions ==
Line 198: Line 241:
 
  <font color="blue">/th.*s/</font>
 
  <font color="blue">/th.*s/</font>
  
  -- I want to match <font color="red"><b>the words that s</b></font>tart
+
  -- I want to match <font color="red">the words that s</font>tart
  -- wi<font color="red"><b>th 'th' and end with 's</b></font>'.
+
  -- wi<font color="red">th 'th' and end with 's</font>'.
  <font color="red"><b>this line matches jus</b></font>t right
+
  <font color="red">this line matches jus</font>t right
  <font color="red"><b>this # thus # this</b></font>tle
+
  <font color="red">this # thus # this</font>tle
  
 
  <font color="blue">/th.*?s/</font>
 
  <font color="blue">/th.*?s/</font>
  
  -- I want to match <font color="red"><b>the words</b></font> that start
+
  -- I want to match <font color="red">the words</font> that start
  -- with '<font color="red"><b>th' and end with 's</b></font>'.
+
  -- with '<font color="red">th' and end with 's</font>'.
  <font color="red"><b>this</b></font> # <font color="red"><b>thus</b></font> # <font color="red"><b>this</b></font>tle
+
  <font color="red">this</font> # <font color="red">thus</font> # <font color="red">this</font>tle
  <font color="red"><b>this</b></font> line matches just right
+
  <font color="red">this</font> line matches just right
  
 
  <font color="blue">/th.*?s /</font>
 
  <font color="blue">/th.*?s /</font>
Line 214: Line 257:
 
  -- I want to match the words that start
 
  -- I want to match the words that start
 
  -- with 'th' and end with 's'. (FINALLY!)
 
  -- with 'th' and end with 's'. (FINALLY!)
  <font color="red"><b>this</b></font> # <font color="red"><b>thus</b></font> # thistle
+
  <font color="red">this</font> # <font color="red">thus</font> # thistle
  <font color="red"><b>this</b></font> line matches just right
+
  <font color="red">this</font> line matches just right
  
 
=== Pattern-match modifiers ===
 
=== Pattern-match modifiers ===
Line 221: Line 264:
 
  <font color="blue">/M.*[ise] /</font>
 
  <font color="blue">/M.*[ise] /</font>
  
  <font color="red"><b>MAINE # Massachusetts </b></font># Colorado #
+
  <font color="red">MAINE # Massachusetts </font># Colorado #
  mississippi # <font color="red"><b>Missouri </b></font># Minnesota #
+
  mississippi # <font color="red">Missouri </font># Minnesota #
  
 
  <font color="blue">/M.*[ise] /i</font>
 
  <font color="blue">/M.*[ise] /i</font>
  
  <font color="red"><b>MAINE # Massachusetts </b></font># Colorado #
+
  <font color="red">MAINE # Massachusetts </font># Colorado #
  <font color="red"><b>mississippi # Missouri </b></font># Minnesota #
+
  <font color="red">mississippi # Missouri </font># Minnesota #
  
 
  <font color="blue">/M.*[ise] /gis</font>
 
  <font color="blue">/M.*[ise] /gis</font>
  
  <font color="red"><b>MAINE # Massachusetts # Colorado #
+
  <font color="red">MAINE # Massachusetts # Colorado #
  mississippi # Missouri </b></font># Minnesota #
+
  mississippi # Missouri </font># Minnesota #
  
=== Changing backreference behavior ===
+
=== Changing back-reference behaviour ===
  
  <font color="blue">s/([A-Z])(?:-[a-z]{3}-)([0-9]*)/\1\2/g</font>
+
  {{Regex/Replace|([A-Z])(?:-[a-z]{3}-)([0-9]*)|\1\2|g}}
  
 
  &lt; A-xyz-37 # B:abcd:142 # C-wxy-66 # D-qrs-93
 
  &lt; A-xyz-37 # B:abcd:142 # C-wxy-66 # D-qrs-93
  &gt; <font color="red"><b>A37</b></font> # B:abcd:42 # <font color="red"><b>C66</b></font> # <font color="red"><b>D93</b></font>
+
  &gt; <font color="red">A37</font> # B:abcd:42 # <font color="red">C66</font> # <font color="red">D93</font>
  
=== Naming backreferences ===
+
=== Naming back-references ===
  
 
  <font color="blue">import re
 
  <font color="blue">import re
Line 248: Line 291:
 
             "\g&lt;prefix&gt;\g&lt;id&gt;", txt) </font>
 
             "\g&lt;prefix&gt;\g&lt;id&gt;", txt) </font>
  
  <font color="red"><b>A37</b></font> # B:abcd:42 # <font color="red"><b>C66</b></font> # <font color="red"><b>D93</b></font>
+
  <font color="red">A37</font> # B:abcd:42 # <font color="red">C66</font> # <font color="red">D93</font>
  
 
=== Lookahead assertions ===
 
=== Lookahead assertions ===
  
  <font color="blue">s/([A-Z]-)(?=[a-z]{3})([a-z0-9]* )/\2\1/g</font>
+
  {{Regex/Replace|([A-Z]-)(?<nowiki>=</nowiki>[a-z]{3})([a-z0-9]* )|\2\1|g}}
  
 
  &lt; A-xyz37 # B-ab6142 # C-Wxy66 # D-qrs93
 
  &lt; A-xyz37 # B-ab6142 # C-Wxy66 # D-qrs93
  &gt; <font color="red"><b>xyz37A-</b></font> # B-ab6142 # C-Wxy66 # <font color="red"><b>qrs93D-</b></font>
+
  &gt; <font color="red">xyz37A-</font> # B-ab6142 # C-Wxy66 # <font color="red">qrs93D-</font>
  
  <font color="blue">s/([A-Z]-)(?![a-z]{3})([a-z0-9]* )/\2\1/g</font>
+
  {{Regex/Replace|([A-Z]-)(?![a-z]{3})([a-z0-9]* )|\2\1|g}}
  
 
  &lt; A-xyz37 # B-ab6142 # C-Wxy66 # D-qrs93
 
  &lt; A-xyz37 # B-ab6142 # C-Wxy66 # D-qrs93
  &gt; A-xyz37 # <font color="red"><b>ab6142B-</b></font> # <font color="red"><b>Wxy66C-</b></font> # D-qrs93
+
  &gt; A-xyz37 # <font color="red">ab6142B-</font> # <font color="red">Wxy66C-</font> # D-qrs93
  
 
=== Making regular expressions more readable ===
 
=== Making regular expressions more readable ===
  
 
  <font color="blue">
 
  <font color="blue">
  /              # identify URLs within a text file
+
  /              </font># identify URLs within a text file<font color="blue">
           [^="] # do not match URLs in IMG tags like:
+
           [^="] </font># do not match URLs in IMG tags like:<font color="blue">
                 # &lt;img <nowiki>src="http://mysite.com/mypic.png"</nowiki>&gt;
+
                 </font># &lt;img <nowiki>src="http://mysite.com/mypic.png"</nowiki>&gt;<font color="blue">
  http|ftp|gopher # make sure we find a resource type
+
  http|ftp|gopher </font># make sure we find a resource type<font color="blue">
           :\/\/ # ...needs to be followed by colon-slash-slash
+
           :\/\/ </font># ...needs to be followed by colon-slash-slash<font color="blue">
       [^ \n\r]+ # some stuff than space, newline, tab is in URL
+
       [^ \n\r]+ </font># some stuff than space, newline, tab is in URL<font color="blue">
     (?=[\s\.,]) # assert: followed by whitespace/period/comma
+
     (?=[\s\.,]) </font># assert: followed by whitespace/period/comma<font color="blue">
 
  /
 
  /
 
  </font>
 
  </font>
  The URL for my site is: <font color="red"><b><nowiki>http://mysite.com/mydoc.html</nowiki></b></font>. You
+
  The URL for my site is: <font color="red"><nowiki>http://mysite.com/mydoc.html</nowiki></font>. You
  might also enjoy <font color="red"><b><nowiki>ftp://yoursite.com/index.html</nowiki></b></font> for a good
+
  might also enjoy <font color="red"><nowiki>ftp://yoursite.com/index.html</nowiki></font> for a good
 
  place to download files.
 
  place to download files.
  
Line 286: Line 329:
 
* ''Programming Perl'' (1996), Larry Wall, Tom Christiansen & Randal L. Schwartz, O'Reilly, Cambridge, MA.
 
* ''Programming Perl'' (1996), Larry Wall, Tom Christiansen & Randal L. Schwartz, O'Reilly, Cambridge, MA.
  
== External links ==
+
==See also==
* [http://en.wikipedia.org/wiki/Regular_Expressions Wikipedia article on '''Regular Expressions''']
+
*[http://pyparsing.wikispaces.com/ Pyparsing]
* [http://www.regexlib.com/ Regular Expression Library] &mdash; currently contains over 1000 expressions from contributors around the world.
+
 
* [http://www.regular-expressions.info/ Regular-Expressions.info] &mdash; one of the most comprehensive, free regular expression tutorials on the net.
+
==External links==
 +
*[http://www.regexlib.com/ Regular Expression Library] &mdash; currently contains over 1000 expressions from contributors around the world.
 +
*[http://www.regular-expressions.info/ Regular-Expressions.info] &mdash; one of the most comprehensive, free regular expression tutorials on the net.
 +
*[http://www.pcre.org/ PCRE - Perl Compatible Regular Expressions]
 +
**[http://perldoc.perl.org/perlre.html perlre] &mdash; Perl regular expressions
 +
*[http://www.csm.astate.edu/~rossa/regular.html Regular expressions and commands that use them]
 +
*[http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html Regular expressions] &mdash; by The Open Group Base Specifications Issue 6
 +
*[http://osteele.com/tools/rework/ reWork: a regular expression workbench]
 +
*[http://regexpal.com/ RegexPal] &mdash; a JavaScript regular expression tester
 +
*[[wikipedia:Regular expression]]
 +
*[http://regexadvice.com/forums/ RegexAdvice - forum] &mdash; ("''cc''")
 +
===Other===
 +
*[http://www.txt2re.com/ txt2re.com]
  
 
[[Category:Technical and Specialized Skills]]
 
[[Category:Technical and Specialized Skills]]

Latest revision as of 21:46, 9 March 2009

A regular expression (abbreviated as regexp, regex, or regxp, with plural forms regexps, regexes, or regexen) is a string that describes or matches a set of strings, according to certain syntax rules. Regular expressions are used by many text editors and utilities to search and manipulate bodies of text based on certain patterns. Many programming languages support regular expressions for string manipulation. For example, Perl and Tcl have a powerful regular expression engine built directly into their syntax. The set of utilities (including the editor sed and the filter grep) provided by Unix distributions were the first to popularize the concept of regular expressions.

see also: Evolved regular expressions (EREs)

Character classes (with GNU/POSIX extensions)

 [[:alnum:]]  -> [A-Za-z0-9]     # Alphanumeric characters
 [[:alpha:]]  -> [A-Za-z]        # Alphabetic characters
 [[:blank:]]  -> [ \x09]         # Space or tab characters only
 [[:cntrl:]]  -> [\x00-\x19\x7F] # Control characters
 [[:digit:]]  -> [0-9]           # Numeric characters
 [[:graph:]]  -> [!-~]           # Printable and visible characters
 [[:lower:]]  -> [a-z]           # Lower-case alphabetic characters
 [[:print:]]  -> [ -~]           # Printable (non-Control) characters
 [[:punct:]]  -> [!-/:-@[-`{-~]  # Punctuation characters
 [[:space:]]  -> [ \t\v\f]       # All whitespace characters
 [[:upper:]]  -> [A-Z]           # Upper-case alphabetic characters
 [[:xdigit:]] -> [0-9a-fA-F]     # Hexadecimal digit characters (/[\dA-Fa-f]+/)

Metacharacters

General quantifiers

* -> {0,}   # match preceding item zero or more times
+ -> {1,}   # match preceding item one or more times
? -> {0,1}  # match preceding item zero or one time (i.e. optional)

Examples

Character literals

Mary had a little lamb.
And everywhere that Mary went, the lamb was sure to go.
regex  : /Mary/
matches: Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go.

"Escaped" characters literals

regex  : /.*/
matches: Special characters must be escaped.*
regex  : /\.\*/
matches: Special characters must be escaped.*

Positional special characters

regex  : /^Mary/
matches: Mary had a little lamb.
         And everywhere that Mary
         went, the lamb was sure
         to go.
regex  : /Mary$/
matches: Mary had a little lamb.
         And everywhere that Mary
         went, the lamb was sure
         to go.

The "wildcard" character

/.a/ 
Mary had a little lamb.
And everywhere that Mary
went, the lamb was sure
to go.

Grouping regular expressions

/(Mary)( )(had)/ 
Mary had a little lamb.
And everywhere that Mary
went, the lamb was sure
to go.

Character classes

/[a-z]a/ 
Mary had a little lamb.
And everywhere that Mary
went, the lamb was sure
to go.

Complement operator

/[^a-z]a/ 
Mary had a little lamb.
And everywhere that Mary
went, the lamb was sure
to go.

Alternation of patterns

/cat|dog|bird/
The pet store sold cats, dogs, and birds.
/=first|second=/
=first first= # =second second= # =first= # =second=
/(=)(first)|(second)(=)/
=first first= # =second second= # =first= # =second=
/=(first|second)=/
=first first= # =second second= # =first= # =second=

The basic abstract quantifier

/@(=+=)*@/ 

Match with zero in the middle: @@ Subexpresion occurs, but...: @=+=ABC@ Lots of occurrences: @=+==+==+==+==+=@ Must repeat entire pattern: @=+==+=+==+=@

Matching Patterns in Text: Intermediate

More abstract quantifiers

/A+B*C?D
AAAD
ABBBBCD
BBBCD
ABCCD
AAABBBC

Numeric quantifiers

/a{5} b{,6} c{4,8}/
aaaaa bbbbb ccccc
aaa bbb ccc
aaaaa bbbbbbbbbbbbbb ccccc
/a+ b{3,} c?/
aaaaa bbbbb ccccc
aaa bbb ccc
aaaaa bbbbbbbbbbbbbb ccccc
/a{5} b{6,} c{4,8}/
aaaaa bbbbb ccccc
aaa bbb ccc
aaaaa bbbbbbbbbbbbbb ccccc

Backreferences

/(abc|xyz) \1/
jkl abc xyz
jkl xyz abc
jkl abc abc
jkl xyz xyz
/(abc|xyz) (abc|xyz)/
jkl abc xyz
jkl xyz abc
jkl abc abc
jkl xyz xyz

Don't match more than you want to

/th.*s/
-- I want to match the words that start
-- with 'th' and end with 's'.
this
thus
thistle
this line matches too much

Tricks for restraining matches

/th[^s]*./
-- I want to match the words that start
-- with 'th' and end with 's'.
this
thus
thistle
this line matches too much

A literal-string modification example

s/cat/dog/g
< The zoo had wild dogs, bobcats, lions, and other wild cats.
> The zoo had wild dogs, bobdogs, lions, and other wild dogs.

A pattern-match modification example

s/cat|dog/snake/g
< The zoo had wild dogs, bobcats, lions, and other wild cats.
> The zoo had wild snakes, bobsnakes, lions, and other wild snakes.
s/[a-z]+i[a-z]*/nice/g
< The zoo had wild dogs, bobcats, lions, and other wild cats.
> The zoo had nice dogs, bobcats, nice, and other nice cats.

Modification using backreferences

sed -r 's/([A-Z])([0-9]{2,4}) /\2:\1 /g' INPUT
INPUT : A37 B4 C107 D54112 E1103 XXX
OUTPUT: 37:A B4 107:C D54112 1103:E XXX

Match a US telephone number

((\([2-9][0-9]{2}\))?\ ?|[2-9][0-9]{2}(?:\-?|\ ?))[2-9][0-9]{2}[- ]?[0-9]{4}

This regexp matches US telephone numbers in any of 15 formats:

(NPA) PRE-SUFF
(NPA) PRE SUFF
(NPA) PRESUFF
(NPA)PRE-SUFF
(NPA)PRE SUFF
(NPA)PRESUFF
NPA PRE-SUFF
NPA PRE SUFF
NPA PRESUFF
NPAPRE-SUFF
NPAPRE SUFF
NPAPRESUFF
PRE-SUFF
PRE SUFF
PRESUFF

Advanced Regular Expression Extensions

Non-greedy quantifiers

/th.*s/
-- I want to match the words that start
-- with 'th' and end with 's'.
this line matches just right
this # thus # thistle
/th.*?s/
-- I want to match the words that start
-- with 'th' and end with 's'.
this # thus # thistle
this line matches just right
/th.*?s /
-- I want to match the words that start
-- with 'th' and end with 's'. (FINALLY!)
this # thus # thistle
this line matches just right

Pattern-match modifiers

/M.*[ise] /
MAINE # Massachusetts # Colorado #
mississippi # Missouri # Minnesota #
/M.*[ise] /i
MAINE # Massachusetts # Colorado #
mississippi # Missouri # Minnesota #
/M.*[ise] /gis
MAINE # Massachusetts # Colorado #
mississippi # Missouri # Minnesota #

Changing back-reference behaviour

s/([A-Z])(?:-[a-z]{3}-)([0-9]*)/\1\2/g
< A-xyz-37 # B:abcd:142 # C-wxy-66 # D-qrs-93
> A37 # B:abcd:42 # C66 # D93

Naming back-references

import re
txt = "A-xyz-37 # B:abcd:142 # C-wxy-66 # D-qrs-93"
print re.sub("(?P<prefix>[A-Z])(-[a-z]{3}-)(?P<id>[0-9]*)",
            "\g<prefix>\g<id>", txt) 
A37 # B:abcd:42 # C66 # D93

Lookahead assertions

s/([A-Z]-)(?=[a-z]{3})([a-z0-9]* )/\2\1/g
< A-xyz37 # B-ab6142 # C-Wxy66 # D-qrs93
> xyz37A- # B-ab6142 # C-Wxy66 # qrs93D-
s/([A-Z]-)(?![a-z]{3})([a-z0-9]* )/\2\1/g
< A-xyz37 # B-ab6142 # C-Wxy66 # D-qrs93
> A-xyz37 # ab6142B- # Wxy66C- # D-qrs93

Making regular expressions more readable


/               # identify URLs within a text file
          [^="] # do not match URLs in IMG tags like:
                # <img src="http://mysite.com/mypic.png">
http|ftp|gopher # make sure we find a resource type
          :\/\/ # ...needs to be followed by colon-slash-slash
      [^ \n\r]+ # some stuff than space, newline, tab is in URL
    (?=[\s\.,]) # assert: followed by whitespace/period/comma
/

The URL for my site is: http://mysite.com/mydoc.html. You
might also enjoy ftp://yoursite.com/index.html for a good
place to download files.

References

  • TCL/TK in a Nutshell (1999), Paul Raines & Jeff Tranter, O'Reilly, Cambridge, MA.
  • Python Pocket Reference (1998), Mark Lutz, O'Reilly, Cambridge, MA.
  • Mastering Regular Expressions (1997), Friedl, Jeffrey E. F., O'Reilly, Cambridge, MA.
  • sed & awk (1997), Dale Dougherty & Arnold Robbins, O'Reilly, Cambridge, MA.
  • A Practical Guide to Linux (1997), Mark G. Sobell, Addison Wesley, Reading, MA.
  • Programming Perl (1996), Larry Wall, Tom Christiansen & Randal L. Schwartz, O'Reilly, Cambridge, MA.

See also

External links

Other