Regular expression

From Christoph's Personal Wiki
Revision as of 08:18, 23 April 2007 by Christoph (Talk | contribs)

Jump to: navigation, search

A regular expression (abbreviated as regexp, regex, or regxp, with plural forms regexps, regexes, or regexen) is a string that describes or matches a set of strings, according to certain syntax rules. Regular expressions are used by many text editors and utilities to search and manipulate bodies of text based on certain patterns. Many programming languages support regular expressions for string manipulation. For example, Perl and Tcl have a powerful regular expression engine built directly into their syntax. The set of utilities (including the editor sed and the filter grep) provided by Unix distributions were the first to popularize the concept of regular expressions.

see also: Evolved regular expressions (EREs)

Character classes (with GNU/POSIX extensions)

 [[:alnum:]]  -> [A-Za-z0-9]     # Alphanumeric characters
 [[:alpha:]]  -> [A-Za-z]        # Alphabetic characters
 [[:blank:]]  -> [ \x09]         # Space or tab characters only
 [[:cntrl:]]  -> [\x00-\x19\x7F] # Control characters
 [[:digit:]]  -> [0-9]           # Numeric characters
 [[:graph:]]  -> [!-~]           # Printable and visible characters
 [[:lower:]]  -> [a-z]           # Lower-case alphabetic characters
 [[:print:]]  -> [ -~]           # Printable (non-Control) characters
 [[:punct:]]  -> [!-/:-@[-`{-~]  # Punctuation characters
 [[:space:]]  -> [ \t\v\f]       # All whitespace characters
 [[:upper:]]  -> [A-Z]           # Upper-case alphabetic characters
 [[:xdigit:]] -> [0-9a-fA-F]     # Hexadecimal digit characters (/[\dA-Fa-f]+/)

Metacharacters

General quantifiers

* -> {0,}   # match preceding item zero or more times
+ -> {1,}   # match preceding item one or more times
? -> {0,1}  # match preceding item zero or one time (i.e. optional)

Examples

Character literals

Mary had a little lamb.
And everywhere that Mary went, the lamb was sure to go.
regex  : /Mary/
matches: Mary had a little lamb. And everywhere that Mary went, the lamb was sure to go.

"Escaped" characters literals

regex  : /.*/
matches: Special characters must be escaped.*
regex  : /\.\*/
matches: Special characters must be escaped.*

Positional special characters

regex  : /^Mary/
matches: Mary had a little lamb.
         And everywhere that Mary
         went, the lamb was sure
         to go.
regex  : /Mary$/
matches: Mary had a little lamb.
         And everywhere that Mary
         went, the lamb was sure
         to go.

The "wildcard" character

/.a/ 
Mary had a little lamb.
And everywhere that Mary
went, the lamb was sure
to go.

Grouping regular expressions

/(Mary)( )(had)/ 
Mary had a little lamb.
And everywhere that Mary
went, the lamb was sure
to go.

Character classes

/[a-z]a/ 
Mary had a little lamb.
And everywhere that Mary
went, the lamb was sure
to go.

Complement operator

/[^a-z]a/ 
Mary had a little lamb.
And everywhere that Mary
went, the lamb was sure
to go.

Alternation of patterns

/cat|dog|bird/
The pet store sold cats, dogs, and birds.
/=first|second=/
=first first= # =second second= # =first= # =second=
/(=)(first)|(second)(=)/
=first first= # =second second= # =first= # =second=
/=(first|second)=/
=first first= # =second second= # =first= # =second=

The basic abstract quantifier

/@(=+=)*@/ 

Match with zero in the middle: @@ Subexpresion occurs, but...: @=+=ABC@ Lots of occurrences: @=+==+==+==+==+=@ Must repeat entire pattern: @=+==+=+==+=@

Matching Patterns in Text: Intermediate

More abstract quantifiers

/A+B*C?D
AAAD
ABBBBCD
BBBCD
ABCCD
AAABBBC

Numeric quantifiers

/a{5} b{,6} c{4,8}/
aaaaa bbbbb ccccc
aaa bbb ccc
aaaaa bbbbbbbbbbbbbb ccccc
/a+ b{3,} c?/
aaaaa bbbbb ccccc
aaa bbb ccc
aaaaa bbbbbbbbbbbbbb ccccc
/a{5} b{6,} c{4,8}/
aaaaa bbbbb ccccc
aaa bbb ccc
aaaaa bbbbbbbbbbbbbb ccccc

Backreferences

/(abc|xyz) \1/
jkl abc xyz
jkl xyz abc
jkl abc abc
jkl xyz xyz
/(abc|xyz) (abc|xyz)/
jkl abc xyz
jkl xyz abc
jkl abc abc
jkl xyz xyz

Don't match more than you want to

/th.*s/
-- I want to match the words that start
-- with 'th' and end with 's'.
this
thus
thistle
this line matches too much

Tricks for restraining matches

/th[^s]*./
-- I want to match the words that start
-- with 'th' and end with 's'.
this
thus
thistle
this line matches too much

A literal-string modification example

s/cat/dog/g
< The zoo had wild dogs, bobcats, lions, and other wild cats.
> The zoo had wild dogs, bobdogs, lions, and other wild dogs.

A pattern-match modification example

s/cat|dog/snake/g
< The zoo had wild dogs, bobcats, lions, and other wild cats.
> The zoo had wild snakes, bobsnakes, lions, and other wild snakes.
s/[a-z]+i[a-z]*/nice/g
< The zoo had wild dogs, bobcats, lions, and other wild cats.
> The zoo had nice dogs, bobcats, nice, and other nice cats.

Modification using backreferences

sed -r 's/([A-Z])([0-9]{2,4}) /\2:\1 /g' INPUT
INPUT : A37 B4 C107 D54112 E1103 XXX
OUTPUT: 37:A B4 107:C D54112 1103:E XXX

Match a US telephone number

((\([2-9][0-9]{2}\))?\ ?|[2-9][0-9]{2}(?:\-?|\ ?))[2-9][0-9]{2}[- ]?[0-9]{4}

This regexp matches US telephone numbers in any of 15 formats:

(NPA) PRE-SUFF
(NPA) PRE SUFF
(NPA) PRESUFF
(NPA)PRE-SUFF
(NPA)PRE SUFF
(NPA)PRESUFF
NPA PRE-SUFF
NPA PRE SUFF
NPA PRESUFF
NPAPRE-SUFF
NPAPRE SUFF
NPAPRESUFF
PRE-SUFF
PRE SUFF
PRESUFF

Advanced Regular Expression Extensions

Non-greedy quantifiers

/th.*s/
-- I want to match the words that start
-- with 'th' and end with 's'.
this line matches just right
this # thus # thistle
/th.*?s/
-- I want to match the words that start
-- with 'th' and end with 's'.
this # thus # thistle
this line matches just right
/th.*?s /
-- I want to match the words that start
-- with 'th' and end with 's'. (FINALLY!)
this # thus # thistle
this line matches just right

Pattern-match modifiers

/M.*[ise] /
MAINE # Massachusetts # Colorado #
mississippi # Missouri # Minnesota #
/M.*[ise] /i
MAINE # Massachusetts # Colorado #
mississippi # Missouri # Minnesota #
/M.*[ise] /gis
MAINE # Massachusetts # Colorado #
mississippi # Missouri # Minnesota #

Changing back-reference behaviour

s/([A-Z])(?:-[a-z]{3}-)([0-9]*)/\1\2/g
< A-xyz-37 # B:abcd:142 # C-wxy-66 # D-qrs-93
> A37 # B:abcd:42 # C66 # D93

Naming back-references

import re
txt = "A-xyz-37 # B:abcd:142 # C-wxy-66 # D-qrs-93"
print re.sub("(?P<prefix>[A-Z])(-[a-z]{3}-)(?P<id>[0-9]*)",
            "\g<prefix>\g<id>", txt) 
A37 # B:abcd:42 # C66 # D93

Lookahead assertions

s/([A-Z]-)(?=[a-z]{3})([a-z0-9]* )/\2\1/g
< A-xyz37 # B-ab6142 # C-Wxy66 # D-qrs93
> xyz37A- # B-ab6142 # C-Wxy66 # qrs93D-
s/([A-Z]-)(?![a-z]{3})([a-z0-9]* )/\2\1/g
< A-xyz37 # B-ab6142 # C-Wxy66 # D-qrs93
> A-xyz37 # ab6142B- # Wxy66C- # D-qrs93

Making regular expressions more readable


/               # identify URLs within a text file
          [^="] # do not match URLs in IMG tags like:
                # <img src="http://mysite.com/mypic.png">
http|ftp|gopher # make sure we find a resource type
          :\/\/ # ...needs to be followed by colon-slash-slash
      [^ \n\r]+ # some stuff than space, newline, tab is in URL
    (?=[\s\.,]) # assert: followed by whitespace/period/comma
/

The URL for my site is: http://mysite.com/mydoc.html. You
might also enjoy ftp://yoursite.com/index.html for a good
place to download files.

References

  • TCL/TK in a Nutshell (1999), Paul Raines & Jeff Tranter, O'Reilly, Cambridge, MA.
  • Python Pocket Reference (1998), Mark Lutz, O'Reilly, Cambridge, MA.
  • Mastering Regular Expressions (1997), Friedl, Jeffrey E. F., O'Reilly, Cambridge, MA.
  • sed & awk (1997), Dale Dougherty & Arnold Robbins, O'Reilly, Cambridge, MA.
  • A Practical Guide to Linux (1997), Mark G. Sobell, Addison Wesley, Reading, MA.
  • Programming Perl (1996), Larry Wall, Tom Christiansen & Randal L. Schwartz, O'Reilly, Cambridge, MA.

External links

Other