Difference between revisions of "Agrep"
(→See also) |
|||
(One intermediate revision by the same user not shown) | |||
Line 199: | Line 199: | ||
*[[rgrep]] — a recursive, highlighting grep program | *[[rgrep]] — a recursive, highlighting grep program | ||
*[[ngrep]] — network grep | *[[ngrep]] — network grep | ||
+ | *[http://petdance.com/ack/ ack] — an extended grep utility written in [[Perl]] | ||
+ | *[http://pypi.python.org/pypi/FuGrep/ FuGrep] — [[Python]] bindings to agrep, a fuzzy string matching tool (better than agrepy) | ||
+ | *[http://luggage.bcs.uwa.edu.au/~michaelw//pyagrep.html agrepy] — [[Python]] port of agrep string matching with errors (aka <code>pyagrep</code>) | ||
==External links== | ==External links== |
Latest revision as of 06:33, 20 April 2012
agrep (approximate grep) is a "fuzzy string searching" program or command line tool for use with the Linux operating system.
It selects the best-suited algorithm for the current query from a variety of the known fastest (built-in) string searching algorithms, including a bitap algorithm based on Levenshtein distances.
agrep is also the search engine in the indexer program GLIMPSE. It is free for private and non-commercial use only, and belongs to the University of Arizona.
Contents
Formal definition
Note: The following is taken directly from the agrep README file.
The three most significant features of agrep that are not supported by the grep family are:
- the ability to search for approximate patterns;
- for example, "
agrep -2 homogenos foo
" will find homogeneous as well as any other word that can be obtained from homogenos with at most 2 substitutions, insertions, or deletions. - "
agrep -B homogenos foo
" will generate a message of the form best match has 2 errors, there are 5 matches, output them? (y/n)
- for example, "
- agrep is record oriented rather than just line oriented; a record is by default a line, but it can be user defined;
- for example, "
agrep -d '^From ' 'pizza' mbox
" outputs all mail messages that contain the keyword "pizza
". - Another example: "
agrep -d '$$' pattern foo
" will output all paragraphs (separated by an empty line) that contain pattern.
- for example, "
- multiple patterns with AND (or OR) logic queries.
- For example, "
agrep -d '^From ' 'burger,pizza' mbox
" outputs all mail messages containing at least one of the two keywords (,
stands forOR
). - "
agrep -d '^From ' 'good;pizza' mbox
" outputs all mail messages containing both keywords.
- For example, "
Putting these options together one can ask queries like:
agrep -d '$$' -2 '<CACM>;TheAuthor;Curriculum;<198[5-9]>' bib
which outputs all paragraphs referencing articles in CACM between 1985 and 1989 by TheAuthor dealing with curriculum. Two errors are allowed, but they cannot be in either CACM or the year (the <>
brackets forbid errors in the pattern between them).
Other features include searching for regular expressions (with or without errors), unlimited wild cards, limiting the errors to only insertions or only substitutions or any combination, allowing each deletion, for example, to be counted as, say, 2 substitutions or 3 insertions, estricting parts of the query to be exact and parts to be approximate, and many more.
agrep is available by anonymous ftp from cs.arizona.edu (IP 192.12.69.5
) as agrep/agrep-2.04.tar.Z
(or in uncompressed form as agrep/agrep-2.04.tar
). The tar file contains
- the source code (in C);
- man pages (
agrep.1
); and - two additional files,
agrep.algorithms
andagrep.chronicle
, giving more information.
The agrep directory also includes two postscript files:
agrep.ps.1
is a technical report from June 1991 describing the design and implementation of agrep;agrep.ps.2
is a copy of the paper as appeared in the 1992 Winter USENIX conference.
Usage
Note: The following is generally only for TRE agrep.
agrep [options] [-f patternfile] pattern [files]
- A first example:
agrep Stine *
searches all files in the current directory for any occurrences of the pattern Stine
. As AGREP searches are case-sensitive by default, here it would find abcStinexyz
but it would not find abcstinexyz
.
- A second example:
agrep -ia résumé * agrep -ia resume *
would both find "Résumé", "RÉSUMÉ", "resume", "Resümee" (and also e.g. "rèsümê").
The -ia
option maps characters with accents or "Umlauts" to the corresponding unaccented letter. The German ß as in Straße (meaning street) is treated as a single s.
Note: The search pattern must be enclosed in "double quotes" if it contains metasymbols. A good practice is always to include the search pattern in double quotes.
Options
Note: see agrep --help for full list.
- Regexp selection and interpretation:
-e, --regexp=PATTERN use PATTERN as a regular expression -i, --ignore-case ignore case distinctions -k, --literal PATTERN is a literal string -w, --word-regexp force PATTERN to match only whole words
- Approximate matching settings:
-D, --delete-cost=NUM set cost of missing characters -I, --insert-cost=NUM set cost of extra characters -S, --substitute-cost=NUM set cost of wrong characters -E, --max-errors=NUM select records that have at most NUM errors -# select records that have at most # errors (# is a digit between 0 and 9)
- Miscellaneous:
-d, --delimiter=PATTERN set the record delimiter regular expression -v, --invert-match select non-matching records -V, --version print version information and exit -y, --nothing does nothing (for compatibility with the non-free agrep program) --help display this help and exit
- Output control:
-B, --best-match only output records with least errors -c, --count only print a count of matching records per FILE -h, --no-filename suppress the prefixing filename on output -H, --with-filename print the filename for each match -l, --files-with-matches only print FILE names containing matches -M, --delimiter-after print record delimiter after record if -d is used -n, --record-number print record number with output --line-number same as -n -s, --show-cost print match cost with output --colour, --color use markers to distinguish the matching strings --show-position prefix each output record with start and end position of the first match within the record
With no FILE, or when FILE is -, reads standard input. If less than two FILEs are given, -h is assumed. Exit status is 0 if a match is found, 1 for no match, and 2 if there were errors. If -E or -# is not specified, only exact matches are selected.
PATTERN is a POSIX extended regular expression (ERE) with the TRE extensions. See tre(7) for a complete description.
Metasymbols
\z turns off any special meaning of character z (\# matches #) ^ begin-of-line symbol $ end-of-line symbol . matches any single character (except newline) # matches any number > 0 of arbitrary characters (×)* matches zero or more instances of preceding token × (Kleene closure) ×(×)* matches one or more instances of preceding token × (Positive closure) (Use this as replacement for (×)+ which is not implemented yet.)
Sets
[b-dq-tz] matches characters b c d q r s t z [^b-diq-tz] matches all characters except b c d i q r s t z ab|cd matches "ab" or "cd" <abcd> matches exactly, no errors allowed in string "abcd" (overrides the -1 option)
Operators (and, or)
The operators ; (and) and , (or) must not appear together in a pattern.
cat;dog matches records having "cat" and "dog" cat,dog matches records having "cat" or "dog"
The Kleene closure of the language A is the language formed by the union of zero and more concatenations of A. The Positive closure of the language A is the language formed by the union of one and more concatenations of A.
Extended examples
- show lines in file foo having strings "color" or "colour" or "colonizer" or "coloniser" etc:
agrep "colo#r" foo
- count lines in file foo having string "miscellaneous", within 2 errors, case insensitive:
agrep -2 -ci miscellaneous foo
- show lines in file foo having string "From" at the beginning of a line and string ".edu" at the end of the line:
agrep "^From#\.edu$" foo or agrep --regexp='^From.*\.edu$' foo
- show lines in file foo having string beginning "abc", followed by one digit, then zero or more repetitions of "de" or "fg", and finally x, y or z:
agrep "abc[0-9](de|fg)*[x-z]" foo
- show messages in file mbox having string "search" and string "retriev" (Messages are delimited by the string "From " at the beginning of a line):
agrep -d "^From " "search;retriev" mbox
- show lines in file foo having string "bug report", or string "bug" at end of a line and the string "report" at the beginning of the next line:
agrep -1 -d "$$" "<bug> <report>" foo
- find records in file foo that contain a supersequence of the pattern: "EPO" will match "European Patent Office":
agrep -p "EPO" foo
- matches "74LS04" because of the digit-digit-letter(..) pattern:
agrep -i# "11zz11" foo
- case-insensitive search for needle in file foo with no output at all. The -V0 option even avoids the display of number of "Grand Total" matches:
agrep -isV0 needle foo
agrep in R
- Description
Searches for approximate matches to pattern (the first argument) within the string x (the second argument) using the Levenshtein edit distance.
- Usage
agrep(pattern, x, ignore.case = FALSE, value = FALSE, max.distance = 0.1)
- Arguments
- pattern
- a non-empty character string to be matched (not a regular expression!). Coerced by as.character to a string if possible.
- x
- character vector where matches are sought. Coerced by as.character to a character vector if possible.
- ignore.case
- if FALSE, the pattern matching is case sensitive and if TRUE, case is ignored during matching.
- value
- if FALSE, a vector containing the (integer) indices of the matches determined is returned and if TRUE, a vector containing the matching elements themselves is returned.
- max.distance
- Maximum distance allowed for a match. Expressed either as integer, or as a fraction of the pattern length (will be replaced by the smallest integer not less than the corresponding fraction), or a list with possible components
- all
- maximal (overall) distance
- insertions
- maximum number/fraction of insertions
- deletions
- maximum number/fraction of deletions
- substitutions
- maximum number/fraction of substitutions
- If all is missing, it is set to 10%, the other components default to all. The component names can be abbreviated.
- Details
The Levenshtein edit distance is used as measure of approximateness: it is the total number of insertions, deletions and substitutions required to transform one string into another. The function is a simple interface to the apse library developed by Jarkko Hietaniemi (also used in the Perl String::Approx
module).
- Value
Either a vector giving the indices of the elements that yielded a match, of, if value is TRUE, the matched elements (after coercion, preserving names but no other attributes).
- Examples:
agrep("lasy", "1 lazy 2") agrep("lasy", "1 lazy 2", max = list(sub = 0)) agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2) agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2, value = TRUE) agrep("laysy", c("1 lazy", "1", "1 LAZY"), max = 2, ignore.case = TRUE)
Variations
The two most common flavours of agrep are:
- Wu-Manber agrep; and
- TRE agrep
TRE agrep is the more recent of the two and is the command-line tool provided with the TRE regular expression library. TRE agrep is more powerful than Wu-Manber agrep since it allows weights and total costs to be assigned separately to individual groups in the pattern. TRE agrep allows full regexps of any length, any number of errors, and non-uniform costs for insertion, deletion, and substitution. It can also handle Unicode. Unlike Wu-Manber agrep, TRE agrep is licensed under the GNU LGPL.
See also
- grep (also fgrep and egrep)
- rgrep — a recursive, highlighting grep program
- ngrep — network grep
- ack — an extended grep utility written in Perl
- FuGrep — Python bindings to agrep, a fuzzy string matching tool (better than agrepy)
- agrepy — Python port of agrep string matching with errors (aka
pyagrep
)
External links
- Wu-Manber agrep for Unix
- cgrep a command line approximate string matching tool
- nrgrep a command line approximate string matching tool
- TRE regexp matching package
- TRE agrep — lots of useful information