Sgrep

From Christoph's Personal Wiki
Revision as of 01:55, 26 April 2007 by Christoph (Talk | contribs) (Sgrep (command) moved to Sgrep)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
The correct title of this article is sgrep. The initial letter is capitalized due to technical restrictions.

sgrep (structured grep) is a command line tool for searching and indexing text, SGML, XML, and HTML files and filtering text streams using structural criteria. It was written by Jani Jaakkola and Pekka Kilpeläinen. It is included in the SuSE Linux package.

The data model of sgrep is based on regions, which are nonempty substrings of text. Regions are typically occurrences of constant strings, SGML-tags, or meaningful text elements, which are recognizable through some delimiting strings or the builtin SGML, XML and HTML parser. Regions can be arbitrarily long, arbitrarily overlapping, and arbitrarily nested.

Like grep, sgrep can be used for any kind of text files. However it is most useful for text files containing some kind of structured text. Sgrep is a convenient tool for making queries to almost any kind of text files with some well kown structure. These include programs, mail folders, news folders, HTML, SGML, TeX, etc. With relatively simple queries you can display mail messages by their subject or sender, extract titles or links or any regions from HTML files, function prototypes from C or make complex queries to SGML files based on the DTD of the file.

Usage

Simple example

% sgrep -o "%f:%r\n" 'word("foo") or word("bar")' foobar

Advanced example

This example will show how to search for the famous phrase "To be or not to be: that is the question" in Jon Bosak's Revised XML Document Collections. (note: Original example.)

  • Step 1: Create a file called 'query' containing the following:
# Finds elements having given name
define(ELEMENT, (stag($1) .. etag($1)))

# Finds LINE elements
define(E_LINE, (ELEMENT("LINE")))

# Finds SPEECH elements
define(E_SPEECH, (ELEMENT("SPEECH")))

# Finds SPEECH elements where HAMLET is speaking
define(HAMLET_SPEAKING, (E_SPEECH containing (
                ELEMENT("SPEAKER") containing word("HAMLET"))))

# Finds LINE elements containing words to, be, not and question
define(TOBENOTQUESTION, (E_LINE containing word("to") containing word("be")
   containing word("not") containing word("question")))   

# Finds the LINE where HAMLET says the famous words
define(HAMLET_SAYS, (TOBENOTQUESTION in HAMLET_SPEAKING))
  • Step 2: Create a file called 'Bosak_filelist.txt' containing a list of all Shakespeare XML files.
  • Step 3: Create an index of the input XML texts (greatly improves performance; checked with time command):
% sgrep -I -c index -v -F Bosak_filelist.txt
  • Step 4: Issue the following command
% sgrep -x index -o "%f:\n %r\n" -f query -e HAMLET_SAYS -F Bosak_filelist.txt

See also

External links