Difference between revisions of "Agrep"

From Christoph's Personal Wiki
Jump to: navigation, search
(Usage)
Line 1: Line 1:
{{lowercase|title=agrep}}
+
'''agrep''' (approximate [[grep]]) is a "fuzzy string searching" program or [[:Category:Linux Command Line Tools|command line tool]] for use with the [[Linux]] operating system.
 
+
'''agrep''' (approximate [[Grep (command)|grep]]) is a "fuzzy string searching" program or [[:Category:Linux Command Line Tools|command line tool]] for use with the [[Linux]] operating system.
+
  
 
It selects the best-suited algorithm for the current query from a variety of the known fastest (built-in) string searching algorithms, including a bitap algorithm based on [[Levenshtein distance]]s.
 
It selects the best-suited algorithm for the current query from a variety of the known fastest (built-in) string searching algorithms, including a bitap algorithm based on [[Levenshtein distance]]s.
Line 7: Line 5:
 
agrep is also the search engine in the indexer program [[GLIMPSE]]. It is free for private and non-commercial use only, and belongs to the University of Arizona.
 
agrep is also the search engine in the indexer program [[GLIMPSE]]. It is free for private and non-commercial use only, and belongs to the University of Arizona.
  
== Variations ==
+
==Variations==
 
The two most common flavours of agrep are:
 
The two most common flavours of agrep are:
 
* Wu-Manber agrep; and
 
* Wu-Manber agrep; and
Line 23: Line 21:
 
  agrep Stine *
 
  agrep Stine *
  
searches all files in the current directory for any occurrences of the pattern Madonna. As AGREP searches are case-sensitive by default, here it would find <code>abcStinexyz</code> but it would not find <code>abcstinexyz</code>.
+
searches all files in the current directory for any occurrences of the pattern <code>Stine</code>. As AGREP searches are case-sensitive by default, here it would find <code>abcStinexyz</code> but it would not find <code>abcstinexyz</code>.
  
 
*A second example:
 
*A second example:
Line 36: Line 34:
 
Note: The search pattern must be enclosed in "double quotes" if it contains [[Agrep#metasymbols|metasymbols]]. A good practice is always to include the search pattern in double quotes.
 
Note: The search pattern must be enclosed in "double quotes" if it contains [[Agrep#metasymbols|metasymbols]]. A good practice is always to include the search pattern in double quotes.
  
== Options ==
+
==Options==
 
''Note: see <tt>agrep --help</tt> for full list.''
 
''Note: see <tt>agrep --help</tt> for full list.''
  
Line 80: Line 78:
 
PATTERN is a POSIX extended regular expression (ERE) with the TRE extensions. See tre(7) for a complete description.
 
PATTERN is a POSIX extended regular expression (ERE) with the TRE extensions. See tre(7) for a complete description.
  
=== Metasymbols ===
+
===Metasymbols===
 
  \z turns off any special meaning of character z (\# matches #)
 
  \z turns off any special meaning of character z (\# matches #)
 
  ^ begin-of-line symbol
 
  ^ begin-of-line symbol
Line 90: Line 88:
 
         (Use this as replacement for (×)+ which is not implemented yet.)
 
         (Use this as replacement for (×)+ which is not implemented yet.)
  
=== Sets ===
+
===Sets===
 
  [b-dq-tz]      matches characters b c d q r s t z
 
  [b-dq-tz]      matches characters b c d q r s t z
 
  [^b-diq-tz]    matches all characters except b c d i q r s t z
 
  [^b-diq-tz]    matches all characters except b c d i q r s t z
Line 96: Line 94:
 
  <abcd>        matches exactly, no errors allowed in string "abcd" (overrides the -1 option)
 
  <abcd>        matches exactly, no errors allowed in string "abcd" (overrides the -1 option)
  
=== Operators (and, or) ===
+
===Operators (and, or)===
 
The operators ; (and) and , (or) must not appear together in a pattern.
 
The operators ; (and) and , (or) must not appear together in a pattern.
 
  cat;dog        matches records having "cat" and "dog"
 
  cat;dog        matches records having "cat" and "dog"
Line 104: Line 102:
 
The '''Positive closure''' of the ''language A'' is the language formed by the union of one and more concatenations of ''A''.
 
The '''Positive closure''' of the ''language A'' is the language formed by the union of one and more concatenations of ''A''.
  
== Extended examples ==
+
==Extended examples==
  
* show lines in file foo having strings "color" or "colour" or "colonizer" or "coloniser" etc:
+
*show lines in file foo having strings "color" or "colour" or "colonizer" or "coloniser" etc:
 
  agrep "colo#r" foo
 
  agrep "colo#r" foo
  
* count lines in file foo having string "miscellaneous", within 2 errors, case insensitive:
+
*count lines in file foo having string "miscellaneous", within 2 errors, case insensitive:
 
  agrep -2 -ci miscellaneous foo
 
  agrep -2 -ci miscellaneous foo
  
* show lines in file foo having string "From" at the beginning of a line and string ".edu" at the end of the line:
+
*show lines in file foo having string "From" at the beginning of a line and string ".edu" at the end of the line:
 
  agrep "^From#\.edu$" foo
 
  agrep "^From#\.edu$" foo
 
   or
 
   or
 
  agrep --regexp='^From.*\.edu$' foo
 
  agrep --regexp='^From.*\.edu$' foo
  
* show lines in file foo having string beginning "abc", followed by one digit, then zero or more repetitions of "de" or "fg", and finally x, y or z:
+
*show lines in file foo having string beginning "abc", followed by one digit, then zero or more repetitions of "de" or "fg", and finally x, y or z:
 
  agrep "abc[0-9](de|fg)*[x-z]" foo
 
  agrep "abc[0-9](de|fg)*[x-z]" foo
  
* show messages in file mbox having string "search" and string "retriev" (Messages are delimited by the string "From " at the beginning of a line):
+
*show messages in file mbox having string "search" and string "retriev" (Messages are delimited by the string "From " at the beginning of a line):
 
  agrep -d "^From " "search;retriev" mbox
 
  agrep -d "^From " "search;retriev" mbox
  
* show lines in file foo having string "bug report", or string "bug" at end of a line and the string "report" at the beginning of the next line:
+
*show lines in file foo having string "bug report", or string "bug" at end of a line and the string "report" at the beginning of the next line:
 
  agrep -1 -d "$$" "<bug> <report>" foo
 
  agrep -1 -d "$$" "<bug> <report>" foo
  
* find records in file foo that contain a supersequence of the pattern: "EPO" will match "European Patent Office":
+
*find records in file foo that contain a supersequence of the pattern: "EPO" will match "European Patent Office":
 
  agrep -p "EPO" foo
 
  agrep -p "EPO" foo
  
* matches "74LS04" because of the digit-digit-letter(..) pattern:
+
*matches "74LS04" because of the digit-digit-letter(..) pattern:
 
  agrep -i# "11zz11" foo
 
  agrep -i# "11zz11" foo
  
* case-insensitive search for needle in file foo with no output at all. The -V0 option even avoids the display of number of "Grand Total" matches:
+
*case-insensitive search for needle in file foo with no output at all. The -V0 option even avoids the display of number of "Grand Total" matches:
 
  agrep -isV0 needle foo
 
  agrep -isV0 needle foo
  
== See also ==
+
==See also==
* [[Grep (command)|grep]]
+
*[[grep]]
  
== External links ==
+
==External links==
* [ftp://ftp.cs.arizona.edu/agrep/ Wu-Manber agrep for Unix]
+
*[ftp://ftp.cs.arizona.edu/agrep/ Wu-Manber agrep for Unix]
* [http://www.bell-labs.com/project/wwexptools/cgrep/ cgrep a command line approximate string matching tool]
+
*[http://www.bell-labs.com/project/wwexptools/cgrep/ cgrep a command line approximate string matching tool]
* [http://www.dcc.uchile.cl/~gnavarro/software/ nrgrep] a command line approximate string matching tool
+
*[http://www.dcc.uchile.cl/~gnavarro/software/ nrgrep] a command line approximate string matching tool
* [http://laurikari.net/tre TRE regexp matching package]
+
*[http://laurikari.net/tre TRE regexp matching package]
* [http://www.tgries.de/ TRE agrep] &mdash; lots of useful information
+
*[http://www.tgries.de/ TRE agrep] &mdash; lots of useful information
  
 
[[Category:Linux Command Line Tools]]
 
[[Category:Linux Command Line Tools]]

Revision as of 01:47, 26 April 2007

agrep (approximate grep) is a "fuzzy string searching" program or command line tool for use with the Linux operating system.

It selects the best-suited algorithm for the current query from a variety of the known fastest (built-in) string searching algorithms, including a bitap algorithm based on Levenshtein distances.

agrep is also the search engine in the indexer program GLIMPSE. It is free for private and non-commercial use only, and belongs to the University of Arizona.

Variations

The two most common flavours of agrep are:

  • Wu-Manber agrep; and
  • TRE agrep

TRE agrep is the more recent of the two and is the command-line tool provided with the TRE regular expression library. TRE agrep is more powerful than Wu-Manber agrep since it allows weights and total costs to be assigned separately to individual groups in the pattern. TRE agrep allows full regexps of any length, any number of errors, and non-uniform costs for insertion, deletion, and substitution. It can also handle Unicode. Unlike Wu-Manber agrep, TRE agrep is licensed under the GNU LGPL.

Usage

Note: The following is generally only for TRE agrep.

agrep [options] [-f patternfile] pattern [files]
  • A first example:
agrep Stine *

searches all files in the current directory for any occurrences of the pattern Stine. As AGREP searches are case-sensitive by default, here it would find abcStinexyz but it would not find abcstinexyz.

  • A second example:
agrep -ia résumé *
agrep -ia resume *

would both find "Résumé", "RÉSUMÉ", "resume", "Resümee" (and also e.g. "rèsümê").

The -ia option maps characters with accents or "Umlauts" to the corresponding unaccented letter. The German ß as in Straße (meaning street) is treated as a single s.

Note: The search pattern must be enclosed in "double quotes" if it contains metasymbols. A good practice is always to include the search pattern in double quotes.

Options

Note: see agrep --help for full list.

  • Regexp selection and interpretation:
 -e, --regexp=PATTERN      use PATTERN as a regular expression
 -i, --ignore-case         ignore case distinctions
 -k, --literal             PATTERN is a literal string
 -w, --word-regexp         force PATTERN to match only whole words
  • Approximate matching settings:
 -D, --delete-cost=NUM     set cost of missing characters
 -I, --insert-cost=NUM     set cost of extra characters
 -S, --substitute-cost=NUM set cost of wrong characters
 -E, --max-errors=NUM      select records that have at most NUM errors
 -#                        select records that have at most # errors (# is a
                           digit between 0 and 9)
  • Miscellaneous:
 -d, --delimiter=PATTERN   set the record delimiter regular expression
 -v, --invert-match        select non-matching records
 -V, --version             print version information and exit
 -y, --nothing             does nothing (for compatibility with the non-free
                           agrep program)
     --help                display this help and exit
  • Output control:
 -B, --best-match          only output records with least errors
 -c, --count               only print a count of matching records per FILE
 -h, --no-filename         suppress the prefixing filename on output
 -H, --with-filename       print the filename for each match
 -l, --files-with-matches  only print FILE names containing matches
 -M, --delimiter-after     print record delimiter after record if -d is used
 -n, --record-number       print record number with output
     --line-number         same as -n
 -s, --show-cost           print match cost with output
     --colour, --color     use markers to distinguish the matching strings
     --show-position       prefix each output record with start and end
                           position of the first match within the record

With no FILE, or when FILE is -, reads standard input. If less than two FILEs are given, -h is assumed. Exit status is 0 if a match is found, 1 for no match, and 2 if there were errors. If -E or -# is not specified, only exact matches are selected.

PATTERN is a POSIX extended regular expression (ERE) with the TRE extensions. See tre(7) for a complete description.

Metasymbols

\z 	turns off any special meaning of character z (\# matches #)
^ 	begin-of-line symbol
$ 	end-of-line symbol
. 	matches any single character (except newline)
# 	matches any number > 0 of arbitrary characters
(×)* 	matches zero or more instances of preceding token × (Kleene closure)
×(×)* 	matches one or more instances of preceding token × (Positive closure)
        (Use this as replacement for (×)+ which is not implemented yet.)

Sets

[b-dq-tz]      matches characters b c d q r s t z
[^b-diq-tz]    matches all characters except b c d i q r s t z
ab|cd          matches "ab" or "cd"
<abcd>         matches exactly, no errors allowed in string "abcd" (overrides the -1 option)

Operators (and, or)

The operators ; (and) and , (or) must not appear together in a pattern.

cat;dog        matches records having "cat" and "dog"
cat,dog        matches records having "cat" or "dog"

The Kleene closure of the language A is the language formed by the union of zero and more concatenations of A. The Positive closure of the language A is the language formed by the union of one and more concatenations of A.

Extended examples

  • show lines in file foo having strings "color" or "colour" or "colonizer" or "coloniser" etc:
agrep "colo#r" foo
  • count lines in file foo having string "miscellaneous", within 2 errors, case insensitive:
agrep -2 -ci miscellaneous foo
  • show lines in file foo having string "From" at the beginning of a line and string ".edu" at the end of the line:
agrep "^From#\.edu$" foo
 or
agrep --regexp='^From.*\.edu$' foo
  • show lines in file foo having string beginning "abc", followed by one digit, then zero or more repetitions of "de" or "fg", and finally x, y or z:
agrep "abc[0-9](de|fg)*[x-z]" foo
  • show messages in file mbox having string "search" and string "retriev" (Messages are delimited by the string "From " at the beginning of a line):
agrep -d "^From " "search;retriev" mbox
  • show lines in file foo having string "bug report", or string "bug" at end of a line and the string "report" at the beginning of the next line:
agrep -1 -d "$$" "<bug> <report>" foo
  • find records in file foo that contain a supersequence of the pattern: "EPO" will match "European Patent Office":
agrep -p "EPO" foo
  • matches "74LS04" because of the digit-digit-letter(..) pattern:
agrep -i# "11zz11" foo
  • case-insensitive search for needle in file foo with no output at all. The -V0 option even avoids the display of number of "Grand Total" matches:
agrep -isV0 needle foo

See also

External links