Difference between revisions of "Awk"

From Christoph's Personal Wiki
Jump to: navigation, search
(+cat)
(format)
Line 1: Line 1:
'''AWK''' is a general purpose [[computer language]] that is designed for processing text based data, either in files or data streams. The name AWK is derived from the [[surname]]s of its authors — [[Alfred V. Aho|Alfred V. '''A'''ho]], [[Peter J. Weinberger|Peter J. '''W'''einberger]], and [[Brian Kernighan|Brian W. '''K'''ernighan]]; however, it is commonly pronounced "awk" and not as a string of separate letters.
+
'''AWK''' is a general purpose [[computer language]] that is designed for processing text based data, either in files or data streams.
  
Awk is an example of a [[programming language]] that extensively uses the [[string]] [[datatype]], [[associative array]]s (that is, arrays indexed by key strings), and [[regular expression]]s. The power, terseness, and limitations of awk programs and [[sed]] scripts inspired [[Larry Wall]] to write [[Perl]].
+
Awk is an example of a programming language that extensively uses the string datatype, associative arrays (that is, arrays indexed by key strings), and [[regular expression]]s. The power, terseness, and limitations of awk programs and [[sed]] scripts inspired Larry Wall to write [[Perl]].
 
+
Awk is one of the early tools to appear in [[Version 7 Unix]] and gained popularity as a way to add computational features to a Unix [[pipeline]].
+
A version of awk is a standard feature of nearly every modern [[Unix-like]] operating system available today.  Implementations of awk exist as installed software for almost all other operating systems.
+
  
 
== Structure of awk programs ==
 
== Structure of awk programs ==
Generally speaking, two pieces of data are given to awk: a command file and a primary input file. A command file (which can be an actual file, or can be included in the [[command line]] invocation of awk) contains a series of commands which tell awk how to process the input file. The primary input file is typically text that is formatted in some way; it can be an actual file, or it can be read by awk from the standard input.  A typical awk program consists of a series of lines, each of the form
+
Generally speaking, two pieces of data are given to awk: a command file and a primary input file. A command file (which can be an actual file, or can be included in the [[:Category:Linux Command Line Tools|command line]] invocation of awk) contains a series of commands which tell awk how to process the input file. The primary input file is typically text that is formatted in some way; it can be an actual file, or it can be read by awk from the standard input.  A typical awk program consists of a series of lines, each of the form
  
 
  /''pattern''/ { ''action'' }
 
  /''pattern''/ { ''action'' }
  
where ''pattern'' is a [[regular expression]] and ''action'' is a command.  Awk looks through the input file; when it finds a line that matches ''pattern'', it executes the command(s) specified in ''action''. Alternate line forms include:
+
where ''pattern'' is a [[regular expression]] and ''action'' is a command.  Awk looks through the input file; when it finds a line that matches ''pattern'', it executes the command(s) specified in ''action''. Alternate line forms include:
  
 
; <tt>BEGIN { ''action'' }</tt>
 
; <tt>BEGIN { ''action'' }</tt>
Line 22: Line 19:
 
: Executes ''action'' for each line in the input.
 
: Executes ''action'' for each line in the input.
  
Each of these forms can be included multiple times in the command file. Lines in the command file are executed in order, so if there are two "BEGIN" statements, the first is executed, then the second, and then the rest of the lines. BEGIN and END statements do ''not'' have to be located before and after (respectively) the other lines in the command file.
+
Each of these forms can be included multiple times in the command file. Lines in the command file are executed in order, so if there are two "BEGIN" statements, the first is executed, then the second, and then the rest of the lines. BEGIN and END statements do ''not'' have to be located before and after (respectively) the other lines in the command file.
  
 
Awk was created as a broadbased replacement to C algorithmic approaches developed to integrate text parsing methods.
 
Awk was created as a broadbased replacement to C algorithmic approaches developed to integrate text parsing methods.
  
 
== Awk commands ==
 
== Awk commands ==
Awk commands are the statement that is substituted for ''action'' in the examples above. Awk commands can include function calls, variable assignments, calculations, or any combination thereof. Awk contains built-in support for many functions; many more are provided by the various flavors of awk. Also, some flavors support the inclusion of [[dynamically linked library|dynamically linked libraries]], which can also provide more functions.
+
Awk commands are the statement that is substituted for ''action'' in the examples above. Awk commands can include function calls, variable assignments, calculations, or any combination thereof. Awk contains built-in support for many functions; many more are provided by the various flavors of awk. Also, some flavors support the inclusion of dynamically linked libraries, which can also provide more functions.
  
 
For brevity, the enclosing curly braces ( ''{ }'' ) will be omitted from these examples.
 
For brevity, the enclosing curly braces ( ''{ }'' ) will be omitted from these examples.
  
 
=== The ''print'' command ===
 
=== The ''print'' command ===
The ''print'' command is used to output text. The simplest form of this command is
+
The ''print'' command is used to output text. The simplest form of this command is
  
 
  print
 
  print
  
This displays the contents of the current line. In awk, lines are broken down into ''fields'', and these can be displayed separately:
+
This displays the contents of the current line. In awk, lines are broken down into ''fields'', and these can be displayed separately:
  
 
; <tt>print $1</tt>
 
; <tt>print $1</tt>
Line 43: Line 40:
 
: Displays the first and third fields of the current line, separated by a predefined string called the output field separator (OFS) whose default value is a single space character
 
: Displays the first and third fields of the current line, separated by a predefined string called the output field separator (OFS) whose default value is a single space character
  
Although these fields (''$X'') may bear resemblance to variables (the $ symbol indicates variables in perl), they actually refer to the fields of the current line. A special case, ''$0'', refers to the entire line. In fact, the commands "<tt>print</tt>" and "<tt>print $0</tt>" are identical in functionality.
+
Although these fields (''$X'') may bear resemblance to variables (the $ symbol indicates variables in perl), they actually refer to the fields of the current line. A special case, ''$0'', refers to the entire line. In fact, the commands "<tt>print</tt>" and "<tt>print $0</tt>" are identical in functionality.
  
 
The ''print'' command can also display the results of calculations and/or function calls:
 
The ''print'' command can also display the results of calculations and/or function calls:
Line 57: Line 54:
  
 
=== Variables, et cetera ===
 
=== Variables, et cetera ===
Variable names can use any of the characters [A-Za-z0-9_], with the exception of language keywords. The operators ''+ - * /'' are addition, subtraction, multiplication, and division, respectively. For string concatenation, simply place two variables (or string constants) next to each other, optionally with a space in between. String constants are [[delimited]] by double quotes. Statements need not end with semicolons. Finally, comments can be added to programs by using ''#'' as the first character on a line.
+
Variable names can use any of the characters [A-Za-z0-9_], with the exception of language keywords. The operators ''+ - * /'' are addition, subtraction, multiplication, and division, respectively. For string concatenation, simply place two variables (or string constants) next to each other, optionally with a space in between. String constants are delimited by double quotes. Statements need not end with semicolons. Finally, comments can be added to programs by using ''#'' as the first character on a line.
  
 
=== User-defined functions ===
 
=== User-defined functions ===
Line 71: Line 68:
 
  print add_three(36)    # prints '''39'''
 
  print add_three(36)    # prints '''39'''
  
Functions can have variables that are in the local scope. The names of these are added to the end of the argument list, though values for these should be omitted when calling the function. It is convention to add some [[whitespace]] in the argument list before the local variables, in order to indicate where the parameters end and the local variables begin.
+
Functions can have variables that are in the local scope. The names of these are added to the end of the argument list, though values for these should be omitted when calling the function. It is convention to add some whitespace in the argument list before the local variables, in order to indicate where the parameters end and the local variables begin.
  
 
== Sample Applications ==
 
== Sample Applications ==
Line 94: Line 91:
 
  END { print s }
 
  END { print s }
  
Word Frequency, (uses [[associative array]]s)
+
Word Frequency, (uses associative arrays)
  
 
  { for (i=1; i<=NF; i++)
 
  { for (i=1; i<=NF; i++)
Line 105: Line 102:
  
 
== Awk versions and implementations ==
 
== Awk versions and implementations ==
Awk was originally written in [[1977]], and distributed with Version 7 Unix.
+
In 1985 its authors started expanding the language, most significantly by adding user-defined functions. The language is described in the book ''The AWK Programming Language'', published 1988. To avoid confusion with the incompatible older version, this version was sometimes known as "new awk" or ''nawk''.  This implementation was released under a free software license in 1996, and is still maintained by Brian Kernighan.
 
+
In [[1985]] its authors started expanding the language, most significantly by adding user-defined functions. The language is described in the book ''The AWK Programming Language'', published [[1988]], and its implementation was made available in releases of [[UNIX System V]]. To avoid confusion with the incompatible older version, this version was sometimes known as "new awk" or ''nawk''.  This implementation was released under a [[free software]] license in [[1996]], and is still maintained by Brian Kernighan.
+
 
+
[[GNU]] awk, or ''gawk'', is another free software implementation.  It was written before the original implementation became freely available, and is still widely used.
+
  
mawk is a very fast awk implementation by Mike Brennan based on a byte code interpreter.
+
[[GNU]] awk, or ''gawk'', is another free software implementation. It was written before the original implementation became freely available, and is still widely used.
  
Downloads and further information about these versions are available from the sites listed below.
+
Downloads and further information about these versions are available from the sites listed below ("External links").
  
 
==Christoph's Additions==
 
==Christoph's Additions==
Line 153: Line 146:
  
 
== External links ==
 
== External links ==
*[news:comp.lang.awk comp.lang.awk] is a [[USENET]] [[newsgroup]] dedicated to awk.
+
*[news:comp.lang.awk comp.lang.awk] is a USENET newsgroup dedicated to awk.
 
*[http://www.gnu.org/software/gawk/gawk.html GAWK (GNU Awk) webpage]
 
*[http://www.gnu.org/software/gawk/gawk.html GAWK (GNU Awk) webpage]
 
*[http://freshmeat.net/projects/mawk/ ''mawk download site'']
 
*[http://freshmeat.net/projects/mawk/ ''mawk download site'']
Line 159: Line 152:
  
 
[[Category:Technical and Specialized Skills]]
 
[[Category:Technical and Specialized Skills]]
[[Category:Curly bracket programming languages|AWK]]
+
[[Category:Linux Command Line Tools]]
[[Category:Domain-specific programming languages]]
+
[[Category:Text-oriented programming languages]]
+
 
[[Category:Scripting languages]]
 
[[Category:Scripting languages]]
[[Category:Unix shells]]
 
[[Category:Unix software]]
 

Revision as of 22:48, 25 May 2006

AWK is a general purpose computer language that is designed for processing text based data, either in files or data streams.

Awk is an example of a programming language that extensively uses the string datatype, associative arrays (that is, arrays indexed by key strings), and regular expressions. The power, terseness, and limitations of awk programs and sed scripts inspired Larry Wall to write Perl.

Structure of awk programs

Generally speaking, two pieces of data are given to awk: a command file and a primary input file. A command file (which can be an actual file, or can be included in the command line invocation of awk) contains a series of commands which tell awk how to process the input file. The primary input file is typically text that is formatted in some way; it can be an actual file, or it can be read by awk from the standard input. A typical awk program consists of a series of lines, each of the form

/pattern/ { action }

where pattern is a regular expression and action is a command. Awk looks through the input file; when it finds a line that matches pattern, it executes the command(s) specified in action. Alternate line forms include:

BEGIN { action }
Executes action commands at the beginning of the script execution, i.e., before any of the lines are processed.
END { action }
Similar to the previous form, but executes action after the end of input.
/pattern/
Prints any lines matching pattern.
{ action }
Executes action for each line in the input.

Each of these forms can be included multiple times in the command file. Lines in the command file are executed in order, so if there are two "BEGIN" statements, the first is executed, then the second, and then the rest of the lines. BEGIN and END statements do not have to be located before and after (respectively) the other lines in the command file.

Awk was created as a broadbased replacement to C algorithmic approaches developed to integrate text parsing methods.

Awk commands

Awk commands are the statement that is substituted for action in the examples above. Awk commands can include function calls, variable assignments, calculations, or any combination thereof. Awk contains built-in support for many functions; many more are provided by the various flavors of awk. Also, some flavors support the inclusion of dynamically linked libraries, which can also provide more functions.

For brevity, the enclosing curly braces ( { } ) will be omitted from these examples.

The print command

The print command is used to output text. The simplest form of this command is

print

This displays the contents of the current line. In awk, lines are broken down into fields, and these can be displayed separately:

print $1
Displays the first field of the current line
print $1, $3
Displays the first and third fields of the current line, separated by a predefined string called the output field separator (OFS) whose default value is a single space character

Although these fields ($X) may bear resemblance to variables (the $ symbol indicates variables in perl), they actually refer to the fields of the current line. A special case, $0, refers to the entire line. In fact, the commands "print" and "print $0" are identical in functionality.

The print command can also display the results of calculations and/or function calls:

print 3+2
print foobar(3)
print foobar(variable)
print sin(3-2)

Output may be sent to a file

print "expression" > "file name"

Variables, et cetera

Variable names can use any of the characters [A-Za-z0-9_], with the exception of language keywords. The operators + - * / are addition, subtraction, multiplication, and division, respectively. For string concatenation, simply place two variables (or string constants) next to each other, optionally with a space in between. String constants are delimited by double quotes. Statements need not end with semicolons. Finally, comments can be added to programs by using # as the first character on a line.

User-defined functions

In a format similar to C, function definitions consist of the keyword function, the function name, argument names and the function body. Here is an example function:

function add_three(number, temp) {
  temp = number + 3
  return temp
}

This statement can be invoked as follows:

print add_three(36)     # prints 39

Functions can have variables that are in the local scope. The names of these are added to the end of the argument list, though values for these should be omitted when calling the function. It is convention to add some whitespace in the argument list before the local variables, in order to indicate where the parameters end and the local variables begin.

Sample Applications

Here is the ubiquitous "Hello world program" program written in AWK:

BEGIN { print "Hello, world!" }

Print all lines longer than 80 characters. Note that the default action is to print the current line.

length > 80 

Word Count

{ w += NF; c += length}
END { print NR, w, c }

Sum 1st column of input

{ s += $1 }
END { print s }

Word Frequency, (uses associative arrays)

{ for (i=1; i<=NF; i++)
     words[$i]++
}

END { for (i in words)
    print i, words[i]
}

Awk versions and implementations

In 1985 its authors started expanding the language, most significantly by adding user-defined functions. The language is described in the book The AWK Programming Language, published 1988. To avoid confusion with the incompatible older version, this version was sometimes known as "new awk" or nawk. This implementation was released under a free software license in 1996, and is still maintained by Brian Kernighan.

GNU awk, or gawk, is another free software implementation. It was written before the original implementation became freely available, and is still widely used.

Downloads and further information about these versions are available from the sites listed below ("External links").

Christoph's Additions

% sort -rn Ecoli_K-12_MG1655_Main.top.travers | gawk '{if($1 <= x.xx) {print $1}}' | wc -l

% gawk '{print $2}' Ecoli_K-12_MG1655_Main.top.travers | sort > Ecoli_K-12_MG1655_Main.top.travers.col2
% gawk '{print $2}' Ecoli_K-12_MG1655_Main.top.cai | sort > Ecoli_K-12_MG1655_Main.top.cai.col2
% comm -12 Ecoli_K-12_MG1655_Main.top.travers.col2 Ecoli_K-12_MG1655_Main.top.cai.col2 | wc -l

Books

Book:
| Title=The AWK Programming Language
| Author=Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger
| Publisher=Addison-Wesley
| Year=1988
| ID=ISBN 0-201-07981-X
| URL=http://cm.bell-labs.com/cm/cs/awkbook/ 
The book's webpage includes downloads of the original implementation of Awk and links to others.
Book:
| Title=GAWK: Effective AWK Programming: A User's Guide for GNU Awk
| Author=Arnold Robbins
| URL=http://www.gnu.org/software/gawk/manual/html_node/index.html
| Edition=Edition 3
Book:
| Title=sed & awk, Second Edition
| Author=Dale Dougherty and Arnold Robbins
| Edition=Second Edition
| Year=March 1997
| ID=ISBN: 1-56592-225-5
| URL=http://www.oreilly.com/catalog/sed2/
| Publisher=O'Reilly Media

External links