Difference between revisions of "Awk/scripts"

From Christoph's Personal Wiki
Jump to: navigation, search
 
(Passing in Bash / environment variables)
 
(24 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Examples==
+
see: Main [[Awk]] article.
''Note: Taken directly from Patrick Hartigan's awk page. The # is the comment character for awk. 'field' means 'column'.''
+
see also: [http://www.pement.org/awk/awk1line.txt Handy one-liners for Awk] (v.0.25)
<pre>
+
==Basic examples==
# Print first two fields in opposite order:
+
''Note: The examples in this section have been taken directly from the original Awk paper.<ref name="Aho">Aho AV, Kernighan BW, Weinberger PJ (1978). "Awk - A pattern scanning and processing language". Second Edition, Bell Laboratories, 8 pp.</ref>''
  awk '{ print $2, $1 }' file
+
  
# Print lines longer than 72 characters:
+
===Printing===
  awk 'length > 72' file
+
*Print all input lines whose length exceeds 72 characters:
 +
length > 72
 +
*Print all lines with an even number of fields:
 +
NF % 2 == 0
 +
*Replace the first field of each line by its logarithm:
 +
{ $1 = log($1); print }
 +
*Print the third and second columns of a table in that order:
 +
{print $3,$2}
 +
*Print all input lines with an A, B, or C in the second field:
 +
$2 ~ /A|B|C/
 +
*Print all lines in which the first field is different from the previous first field:
 +
$1 != prev { print; prev = $1 }
 +
*Print each record preceded by the record number and the number of fields:
 +
{print NR, NR, $0}
 +
*Write the first field, <code>$1</code>, on the foo <code>foo1</code>, and the second field on the file <code>foo2</code>:
 +
{print $1>"foo1"; print $2>"foo2"}
 +
*Append the output to the file <code>foo</code>:
 +
print $1>>"foo"
 +
*Use the contents of field 2 as a file name:
 +
print $1>$2
 +
*Print <code>$1</code> as a floating point number 8 digits wide, with two after the decimal point, and <code>$2</code> as a 10-digit long decimal number, followed by a newline:
 +
printf "%8.2f %10ld\n", $1, $2
  
# Print length of string in 2nd column
+
===BEGIN and END===
  awk '{print length($2)}' file
+
''Note: The special pattern <code>BEGIN</code> matches the beginning of the input, before the first record is read. The pattern <code>END</code> matches the end of the input, after the last record has been processed. <code>BEGIN</code> and <code>END</code> thus provide a way to gain control before and after processing, for initialization and wrapup.''
  
# Add up first column, print sum and average:
+
*As an example, the field separator can be set to a colon by:
      { s += $1 }
+
BEGIN { FS = ":" }
  END { print "sum is", s, " average is", s/NR }
+
  ... rest of program ...
 +
*Or, the input lines may be counted by:
 +
END { print NR }
  
# Print fields in reverse order:
+
If <code>BEGIN</code> is present, it must be the first pattern; <code>END</code> must be the last if used.
  awk '{ for (i = NF; i > 0; --i) print $i }' file
+
  
# Print the last line
+
===Regular expressions===
      {line = $0}
+
*Print all lines which contain any occurance of the name "smith":
  END {print line}
+
/smith/
  
# Print the total number of lines that contain the word Pat
+
*<code>()</code> (parentheses) &mdash; for grouping;
  /Pat/ {nlines = nlines + 1}
+
*<code>|</code> (pipe) &mdash; for alternatives;
  END {print nlines}
+
*<code>+</code> &mdash; for "one or more"; and
 +
*<code>?</code> &mdash; for "zero or more"
  
# Print all lines between start/stop pairs:
+
*Print all lines which contain any of the names "Aho", "Weinberger", or "Kernighan", whether capitalized or not:
  awk '/start/, /stop/' file
+
/[Aa]ho|[Ww]einberger|[Kk]ernighan/
 +
*Match any string of characters enclosed in slashes:
 +
/\/.*\//
 +
*Print all lines where the first field matches "john" or "John" (also matches "Johnson", etc.):
 +
$1 ~ /[jJ]ohn/
 +
*Match exactly "john" or "John":
 +
$1 ~ /^[jJ]ohn$/
  
# Print all lines whose first field is different from previous one:
+
===Relational expressions===
  awk '$1 != prev { print; prev = $1 }' file
+
The relational operators are: <code><</code>, <code><=</code>, <code>==</code>, <code>!=</code>, <code>>=</code>, and <code>></code>.
  
# Print column 3 if column 1 > column 2:
+
*Select lines where the second field is at least 100 greater than the first field:
  awk '$1 > $2 {print $3}' file
+
$2 > $1 + 100
 +
*Print lines with an even number of fields:
 +
NF % 2 == 0
 +
*Select lines that begin with an "s", "t", "u", etc.:
 +
$1 >= "s"
 +
*Perform a string comparison (note: In the absence of any other information, fields are treated as strings):
 +
$1 > $2
  
# Print line if column 3 > column 2:
+
===Conditional expressions===
  awk '$3 > $2' file
+
A conditional expression is a special kind of expression with three operands. It allows you to use one expression's value to select one of two other expressions.
  
# Count number of lines where col 3 > col 1
+
The conditional expression looks the same as in the C language (uses the <code>?:</code> ternary operator):
  awk '$3 > $1 {print i + "1"; i++}' file
+
selector ? if-true-exp : if-false-exp
  
# Print sequence number and then column 1 of file:
+
There are three subexpressions. The first, selector, is always computed first. If it is "true" (not zero and not null) then <code>if-true-exp</code> is computed next and its value becomes the value of the whole expression. Otherwise, <code>if-false-exp</code> is computed next and its value becomes the value of the whole expression.
  awk '{print NR, $1}' file
+
  
# Print every line after erasing the 2nd field
+
For example, this expression produces the absolute value of <code>x</code>:
  awk '{$2 = ""; print}' file
+
x > 0 ? x : -x
  
# Print hi 28 times
+
Each time the conditional expression is computed, exactly one of <code>if-true-exp</code> and <code>if-false-exp</code> is computed; the other is ignored. This is important when the expressions contain side effects. For example, this conditional expression examines element <code>i</code> of either array <code>a</code> or array <code>b</code>, and increments <code>i</code>:
  yes | head -28 | awk '{ print "hi" }'
+
x == y ? a[i++] : b[i++]
  
# Print hi.0010 to hi.0099 (NOTE IRAF USERS!)
+
This is guaranteed to increment <code>i</code> exactly once, because each time one or the other of the two increment expressions is executed, and the other is not.  
  yes | head -90 | awk '{printf("hi00%2.0f \n", NR+9)}'
+
  
# Print out 4 random numbers between 0 and 1
+
===Combination of patterns===
  yes | head -4 | awk '{print rand()}'
+
A pattern can be any boolean combination of patterns, using the operators <code>||</code> (or), <code>&&</code> (and), and <code>!</code> (not).
  
# Print out 40 random integers modulo 5
+
*Select lines where the first field begins with "s", but is not "smith":
  yes | head -40 | awk '{print int(100*rand()) % 5}'
+
$1 >= "s" && $1 < "t" && $1 != "smith"
  
# Replace every field by its absolute value
+
===Pattern ranges===
  { for (i = 1; i <= NF; i=i+1) if ($i < 0) $i = -$i print}
+
*Print all lines between <code>start</code> and <code>stop</code>:
 +
/start/,/stop/
 +
*Perform an action for lines 100 through 200:
 +
NR == 100, NR == 200 { ... }
 +
*Print all lines from <code>bar</code> to end of file (or STDIN stream):
 +
$ awk '/bar/,/!./{print $0}' <<< "$(echo -e "foo\nbar\nbaz")"
  
# If you have another character that delimits fields, use the -F option
+
===Built-in functions===
# For example, to print out the phone number for Jones in the following file,
+
see: [[Awk#String functions|String functions]] for full list.
# 000902|Beavis|Theodore|333-242-2222|149092
+
Arithmetic functions include: <code>sqrt</code>, <code>log</code>, <code>exp</code>, and <code>int</code>
# 000901|Jones|Bill|532-382-0342|234023
+
# ...
+
# type
+
  awk -F"|" '$2=="Jones"{print $4}' filename
+
  
# Some looping commands
+
*Print each record, preceded by its length:
# Remove a bunch of print jobs from the queue
+
{print length, $0}
  BEGIN{
+
*Print the length of its argument (here the length of the input line):
for (i=875;i>833;i--){
+
{print length($0), $0}
printf "lprm -Plw %d\n", i
+
*Print lines whose length is less than 10 or greater than 20:
} exit
+
length < 10 || length > 20
      }
+
  
Formatted printouts are of the form printf( "format\n", value1, value2, ... valueN)
+
The function <code>substr(s,m,n)</code> produces the substring of <code>s</code> that begins at position <code>m</code> (origin 1) and is at most <code>n</code> characters long. If <code>n</code> is omitted, the substring goes to the end of <code>s</code>
e.g. printf("howdy %-8s What it is bro. %.2f\n", $1, $2*$3)
+
%s = string
+
%-8s = 8 character string left justified
+
%.2f = number with 2 places after .
+
%6.2f = field 6 chars with 2 chars after .
+
\n is newline
+
\t is a tab
+
  
# Print frequency histogram of column of numbers
+
The function <code>index(s1,s2)</code> returns the position where the string <code>s2</code> occurs in <code>s1</code>, or zero if it does not.
$2 <= 0.1 {na=na+1}
+
($2 > 0.1) && ($2 <= 0.2) {nb = nb+1}
+
($2 > 0.2) && ($2 <= 0.3) {nc = nc+1}
+
($2 > 0.3) && ($2 <= 0.4) {nd = nd+1}
+
($2 > 0.4) && ($2 <= 0.5) {ne = ne+1}
+
($2 > 0.5) && ($2 <= 0.6) {nf = nf+1}
+
($2 > 0.6) && ($2 <= 0.7) {ng = ng+1}
+
($2 > 0.7) && ($2 <= 0.8) {nh = nh+1}
+
($2 > 0.8) && ($2 <= 0.9) {ni = ni+1}
+
($2 > 0.9) {nj = nj+1}
+
END {print na, nb, nc, nd, ne, nf, ng, nh, ni, nj, NR}
+
  
# Find maximum and minimum values present in column 1
+
The function <code>sprintf(f,e1,e2,...)</code> produces the value of the expressions <code>e1</code>, <code>e2</code>, etc., in the <code>printf</code> format specified by <code>f</code>
NR == 1 {m=$1 ; p=$1}
+
$1 >= m {m = $1}
+
$1 <= p {p = $1}
+
END { print "Max = " m, "  Min = " p }
+
  
# Example of defining variables, multiple commands on one line
+
*Set <code>x</code> to the string produced by formatting the values of <code>$1</code> and <code>$2</code>:
NR == 1 {prev=$4; preva = $1; prevb = $2; n=0; sum=0}
+
x = sprintf("%8.2f %10ld", $1, $2}
$4 != prev {print preva, prevb, prev, sum/n; n=0; sum=0; prev = $4; preva = $1; prevb = $2}
+
$4 == prev {n++; sum=sum+$5/$6}
+
END {print preva, prevb, prev, sum/n}
+
  
# Example of defining and using a function, inserting values into an array
+
===Variables, expressions, and assignments===
# and doing integer arithmetic mod(n). This script finds the number of days
+
The arithmetic operators (done internally in floating point) are: <code>+</code>, <code>-</code>, <code>*</code>, and <code>%</code> (mod). Increment <code>++</code> and decrement <code>--</code> operators are also available, and so are the assignment operators: <code>+=</code>, <code>-=</code>, <code>*=</code>, <code>/=</code> and <code>%=</code>.
# elapsed since Jan 1, 1901. (from http://www.netlib.org/research/awkbookcode/ch3)
+
 
function daynum(y, m, d,   days, i, n)
+
*Assign 1 (a number) to x:
{   # 1 == Jan 1, 1901
+
x=1
 +
*Assign "smith" (a string) to x:
 +
x="smith"
 +
*Assign 7 (a number) to x:
 +
x = "3" + "4"
 +
*Print the sums of the first two fields:
 +
    {s1 += $1; s2 += $2}
 +
END {print s1,s2}
 +
 
 +
===Field variables===
 +
*Replace the first field with a sequence number:
 +
{$1 = NR; print}
 +
*Accumulate two fields into a third:
 +
{$1 = $2 + $3; print $0}
 +
*Assign a string to a field (here, replace the third field by "too big" when it is):
 +
{if($3 > 1000)
 +
    $3 = "too big"
 +
  print
 +
}
 +
*Field references may be numerical expressions:
 +
{print $i, $(i+1), $(i+n)}
 +
*Field treated as strings:
 +
if($1==$2) ...
 +
*Split the string <code>s</code> into <code>array[1],...,array[n]</code> (note: If the <code>sep</code> argument is provided, it is used as the field separator; otherwise <code>FS</code> is used as the separator):
 +
n = split(s,array,sep)
 +
 
 +
===String concatenation===
 +
*Return the length of the first three fields:
 +
length($1 $2 $3)
 +
*Print the two fields separated by " is ":
 +
print $1 " is " $2
 +
 
 +
===Arrays===
 +
*Assign the current input record to the <code>NR</code>-th element of the array <code>x</code>:
 +
x[NR] = $0
 +
*Increment counts for the named array elements, and print them at the end of the input:
 +
/apple/  {x["apple"]++}
 +
/orange/ {x["orange"]++}
 +
END      {print x["apple"], x["orange"]}
 +
 
 +
===Flow-of-control statements===
 +
The flow-of-control statements are: <code>if-else</code>, <code>while</code> and <code>for</code>. Also, the <code>break</code> statement causes an immediate exit from an enclosing <code>while</code> or <code>for</code>; the <code>continue</code> statement causes the next iteration to begin. The <code>next</code> statement causes ''Awk'' to skip immediately to the next record and begin scanning the patterns from the top. The statement <code>exit</code> causes the program to behave as if the end of the input had occurred.
 +
 
 +
*Print all input fields one per line:
 +
i = 1
 +
while(i <= NF) {
 +
    print $i;
 +
    ++i
 +
}
 +
*Print all input fields one per line (same as above):
 +
for(i=1; i<=NF; i++)
 +
    print $i
 +
*Perform ''statement'' with <code>i</code> set in turn to each element of <code>array</code> (an associative array):
 +
for(i in array)
 +
    statement
 +
 
 +
==Self-contained ''Awk'' scripts==
 +
As with many other programming languages, self-contained ''Awk'' script can be constructed using the so-called "[[shebang]]" syntax.
 +
 
 +
For example, a Linux command called <code>hello.awk</code> that prints the string "Hello, world!" may be built by going first creating a file named <code>hello.awk</code> containing the following lines:
 +
 
 +
#!/usr/bin/awk -f
 +
BEGIN {print "Hello, world!"; exit}
 +
 
 +
==Other examples==
 +
*Print a count of words (count words in the input, and print lines, words, and characters (like [[Wc (command)|wc]])):
 +
{ w += NF; c += length}
 +
END { print NR, w, c }
 +
*Sum first column of input:
 +
{ s += $1 }
 +
END { print s }
 +
*Calculate word frequencies (uses associative arrays):
 +
BEGIN { FS="[^a-zA-Z]+"}
 +
{ for (i=1; i<=NF; i++)
 +
    words[tolower($i)]++
 +
}
 +
END { for (i in words)
 +
    print i, words[i]
 +
}
 +
*Another way to calculate word frequencies (also uses associative arrays):
 +
awk '{for(i = 1; i <=NF; i++)
 +
        num[$i]++
 +
    }
 +
    END {for(word in num)
 +
        print word, num[word]
 +
    }
 +
    ' $*
 +
 
 +
===Associative arrays===
 +
Awk has built-in, language-level support for associative arrays.
 +
 
 +
For example:
 +
 
 +
phonebook["Sally Smart"] = "555-9999"
 +
phonebook["John Doe"] = "555-1212"
 +
phonebook["John Doe"] = "555-1337"
 +
 
 +
You can also loop through an associated array as follows:
 +
 
 +
for (name in phonebook) {
 +
    print name, " ", phonebook[name]
 +
}
 +
 
 +
You can also check if an element is in the associative array, and delete elements from an associative array.
 +
 
 +
Multi-dimensional associative arrays can be implemented in standard Awk using concatenation and e.g. SUBSEP:
 +
 
 +
{ # for every input line
 +
    multi[$1 SUBSEP $2]++;
 +
}
 +
#
 +
END {
 +
    for (x in multi) {
 +
        split(x, arr, SUBSEP);
 +
        print arr[1], arr[2], multi[x];
 +
    }
 +
  }
 +
 
 +
===Patrick's===
 +
''Note: Taken directly, with some modifications, from Patrick Hartigan's Awk page. The # is the comment character for Awk. 'field' means 'column'.''
 +
 
 +
*Print length of string in 2nd column
 +
awk '{print length($2)}' file
 +
 
 +
*Add up first column, print sum and average:
 +
      { s += $1 }
 +
END  { print "sum is", s, " average is", s/NR }
 +
 
 +
*Print fields in reverse order:
 +
awk '{ for (i = NF; i > 0; --i) print $i }' file
 +
 
 +
*Print the last line
 +
    {line = $0}
 +
END {print line}
 +
 
 +
*Print the total number of lines that contain the word Stine
 +
/Stine/ {nlines = nlines + 1}
 +
END    {print nlines}
 +
 
 +
*Print all lines between start/stop pairs:
 +
awk '/start/,/stop/' file
 +
 
 +
*Print all lines whose first field is different from previous one:
 +
awk '$1 != prev {print; prev = $1}' file
 +
 
 +
*Print column 3 if column 1 > column 2:
 +
awk '$1 > $2 {print $3}' file
 +
 
 +
*Print line if column 3 > column 2:
 +
awk '$3 > $2' file
 +
 
 +
*Count number of lines where col 3 > col 1
 +
awk '$3 > $1 {print i + "1"; i++}' file
 +
 
 +
*Print sequence number and then column 1 of file:
 +
awk '{print NR, $1}' file
 +
 
 +
*Print every line after erasing the 2nd field
 +
awk '{$2 = ""; print}' file
 +
 
 +
*Print "hi" 28 times
 +
yes | head -28 | awk '{ print "hi" }'
 +
 
 +
*Print "hi.0010" to "hi.0099":
 +
yes | head -90 | awk '{printf("hi00%2.0f\n", NR+9)}'
 +
 
 +
*Print out 4 random numbers between 0 and 1
 +
yes | head -4 | awk '{print rand()}'
 +
#~OR~
 +
seq 1 4 | awk '{print rand()}'
 +
 
 +
*Print out 40 random integers modulo 5
 +
yes | head -40 | awk '{print int(100*rand()) % 5}'
 +
 
 +
*Replace every field by its absolute value
 +
{for (i = 1; i <= NF; i=i+1) if ($i < 0) $i = -$i print}
 +
 
 +
*If you have another character that delimits fields, use the -F option. For example, to print out the phone number for "Brown" in the following file,
 +
000902|Stevens|Alice|333-242-2222|149092
 +
000901|Brown|Bob|532-382-0342|234023
 +
...
 +
the command would be
 +
awk -F"|" '$2=="Brown"{print $4}' filename
 +
 
 +
*Some looping commands (remove a bunch of print jobs from the queue):
 +
BEGIN{
 +
    for (i=875;i>833;i--){
 +
      printf "lprm -Plw %d\n", i
 +
    } exit
 +
}
 +
 
 +
*Formatted printouts are of the form <code>printf( "format\n", value1, value2, ... valueN)</code>. For an example,
 +
printf("howdy %-8s What it is bro. %.2f\n", $1, $2*$3)
 +
# %s = string
 +
# %-8s = 8 character string left justified
 +
# %.2f = number with 2 places after .
 +
# %6.2f = field 6 chars with 2 chars after .
 +
# \n is newline
 +
# \t is a tab
 +
 
 +
*Print frequency histogram of column of numbers:
 +
$2 <= 0.1 {na=na+1}
 +
($2 > 0.1) && ($2 <= 0.2) {nb = nb+1}
 +
($2 > 0.2) && ($2 <= 0.3) {nc = nc+1}
 +
($2 > 0.3) && ($2 <= 0.4) {nd = nd+1}
 +
($2 > 0.4) && ($2 <= 0.5) {ne = ne+1}
 +
($2 > 0.5) && ($2 <= 0.6) {nf = nf+1}
 +
($2 > 0.6) && ($2 <= 0.7) {ng = ng+1}
 +
($2 > 0.7) && ($2 <= 0.8) {nh = nh+1}
 +
($2 > 0.8) && ($2 <= 0.9) {ni = ni+1}
 +
($2 > 0.9) {nj = nj+1}
 +
END {print na, nb, nc, nd, ne, nf, ng, nh, ni, nj, NR}
 +
 
 +
*Find maximum and minimum values present in column 1
 +
NR == 1 {m=$1; p=$1}
 +
$1 >= m {m = $1}
 +
$1 <= p {p = $1}
 +
END { print "Max = " m, "  Min = " p }
 +
 
 +
*Example of defining variables, multiple commands on one line
 +
NR == 1 {prev=$4; preva = $1; prevb = $2; n=0; sum=0}
 +
$4 != prev {print preva, prevb, prev, sum/n; n=0; sum=0; prev = $4; preva = $1; prevb = $2}
 +
$4 == prev {n++; sum=sum+$5/$6}
 +
END {print preva, prevb, prev, sum/n}
 +
 
 +
*Example of defining and using a function, inserting values into an array and doing integer arithmetic mod(n). This script finds the number of days elapsed since 1901-01-01. (from [http://www.netlib.org/research/awkbookcode/ch3 awkbookcode])
 +
function daynum(y, m, d, days, i, n) { # 1 == Jan 1, 1901
 
     split("31 28 31 30 31 30 31 31 30 31 30 31", days)
 
     split("31 28 31 30 31 30 31 31 30 31 30 31", days)
 
     # 365 days a year, plus one for each leap year
 
     # 365 days a year, plus one for each leap year
Line 125: Line 358:
 
         n += days[i]
 
         n += days[i]
 
     return n + d
 
     return n + d
}
+
}
    { print daynum($1, $2, $3) }
+
{print daynum($1, $2, $3)}
  
# Example of using substrings
+
*Example of using substrings (<code>substr($2,9,7)</code> picks out characters 9 through 15 of column 2):
# substr($2,9,7) picks out characters 9 thru 15 of column 2
+
{print "imarith", substr($2,1,7) " - " $3, "out."substr($2,5,3)}
{print "imarith", substr($2,1,7) " - " $3, "out."substr($2,5,3)}
+
{print "imarith", substr($2,9,7) " - " $3, "out."substr($2,13,3)}
{print "imarith", substr($2,9,7) " - " $3, "out."substr($2,13,3)}
+
{print "imarith", substr($2,17,7) " - " $3, "out."substr($2,21,3)}
{print "imarith", substr($2,17,7) " - " $3, "out."substr($2,21,3)}
+
{print "imarith", substr($2,25,7) " - " $3, "out."substr($2,29,3)}
{print "imarith", substr($2,25,7) " - " $3, "out."substr($2,29,3)}
+
 
 +
==Christoph's additions==
 +
$ sort -rn Ecoli.top.travers | gawk '{if($1 <= x.xx) {print $1}}' | wc -l
 +
$ gawk '{print $2}' Ecoli.top.travers | sort > Ecoli.top.travers.col2
 +
$ gawk '{print $2}' Ecoli.top.cai | sort > Ecoli.top.cai.col2
 +
$ comm -12 Ecoli.travers.col2 Ecoli.top.cai.col2 | wc -l
 +
 
 +
* Remove duplicate lines:
 +
$ echo -e 'foo\nbar\nbar\nbaz' | awk '!a[$0]++'
 +
 
 +
=== Date formats ===
 +
* Convert from "Jun 21 2011 5:00PM" to "2011-06-21":
 +
$ echo "Jun 21 2011 5:00PM" | awk '{"date \"+%F\" -d \""$1" "$2" "$3" \"" | getline var; print var}'
 +
 
 +
* Convert date formats from a CSV file:
 +
 
 +
$ cat infile.csv
 +
foo,1,25/03/2012,bob
 +
bar,2,14/11/2013,alice
 +
 
 +
awk 'BEGIN { FS = OFS = "," }
 +
      { split($3, date, /\//)
 +
        $3 = date[3] "-" date[2] "-" date[1]
 +
        print $0
 +
      }' infile.csv
 +
 
 +
===Unix sockets===
 +
 
 +
: see [[Unix sockets]] for more information.
 +
 
 +
* Get a list of local IP addresses and ports listening via UDP:
 +
$ netstat -planu | awk '/^udp / {print $4}'
 +
 
 +
Let's get the same info via the <code>/proc</code> pseudo-filesystem:
 +
 
 +
for h in $(awk 'NR>1{print $2}' /proc/net/udp); do
 +
  printf "%s:%d\n" $(printf "%d." $(echo ${h%:*}|sed 's/../0x& /g'|tr ' ' '\n'|tac)|sed 's/\.$/\n/') 0x${h#*:};
 +
done
 +
 
 +
Nice, but, let's do this in pure Awk:
 +
 
 +
awk 'NR>1{split($2, addr, ":"); for(i=0;i<4;i++){
 +
  printf("%d.",strtonum("0x" substr(addr[1],2*i+1,2)))}; print ":" strtonum("0x" addr[2]);}' /proc/net/udp
 +
 
 +
Almost. Just need to reverse the dotted decimals.
 +
 
 +
echo 0F01A8C0 | awk '{str = sprintf("0x%s", $0); ip = strtonum(str); \
 +
  printf ("%d.%d.%d.%d\t",rshift(and(ip,0x000000ff),00),
 +
                          rshift(and(ip,0x0000ff00),08),
 +
                          rshift(and(ip,0x00ff0000),16),
 +
                          rshift(and(ip,0xff000000),24))}'
 +
 
 +
Final, pure Awk:
 +
 
 +
awk 'NR>1{split($2, a, ":");patsplit(a[1],h,/.{2}/);for(i=4;i>0;i--){
 +
  h[i]=strtonum("0x" h[i]);};
 +
  printf("%d.%d.%d.%d:%d\n",h[4],h[3],h[2],h[1],strtonum("0x" a[2]));}' /proc/net/udp
 +
 
 +
===Scientific notation===
 +
 
 +
* Get the result of 90 billion / 22 million / 365:
 +
$ awk 'BEGIN { print 90e9/22e6/365 }' # => 11.208
 +
$ awk 'BEGIN { printf "%.4f", 90e9/22e6/365 }' # => 11.2080
 +
 
 +
* Other methods:
 +
<pre>
 +
$ perl -le 'print 90e9/22e6/365' # => 11.2079701120797
 +
$ perl -e 'printf "%.4f\n", 90e9/22e6/365' # => 11.2080
 +
$ echo "$(printf '%.4f' '90e9')/$(printf '%.4f' '22e6')/365" | bc -l # => 11.20797011207970112079
 
</pre>
 
</pre>
  
 +
* Regex:
 +
<pre>
 +
$ sed -E 's/([+-]?[0-9.]+)[eE]\+?(-?)([0-9]+)/(\1*10^\2\3)/g' <<<"$value"
 +
$ sed 's/\([+-]\{0,1\}[0-9]*\.\{0,1\}[0-9]\{1,\}\)[eE]+\{0,1\}\(-\{0,1\}\)\([0-9]\{1,\}\)/(\1*10^\2\3)/g' <<<"$value"
 +
$ perl -pe 's/([-\d.]+)e(?:\+|(-))?(\d+)/($1*10^$2$3)/gi' <<<"$value"
 +
</pre>
 +
 +
===Passing in Bash / environment variables===
 +
 +
* Check if version of Docker installed matched version required (as defined with Bash/env variable):
 +
$ docker_version=17.03.2-ce
 +
$ docker --version | \
 +
    awk -v dv="${docker_version}" '/version/{gsub(",","");{if($3==dv){print "true"}else{print "false"}}}'
 +
 +
==References==
 +
<references/>
 
==External links==
 
==External links==
 
*[http://www.netlib.org/research/awkbookcode/ AWK Book Code] &mdash; contains all the programs from The AWK Programming Language, by Aho, Kernighan and Weinberger (Addison-Wesley, 1988). They have been packed by the bundle program found on page 81, and can be unpacked by the unbundle on page 82, also included here. A text editor will also do this pretty easily.
 
*[http://www.netlib.org/research/awkbookcode/ AWK Book Code] &mdash; contains all the programs from The AWK Programming Language, by Aho, Kernighan and Weinberger (Addison-Wesley, 1988). They have been packed by the bundle program found on page 81, and can be unpacked by the unbundle on page 82, also included here. A text editor will also do this pretty easily.

Latest revision as of 22:57, 26 October 2021

see: Main Awk article.
see also: Handy one-liners for Awk (v.0.25)

Basic examples

Note: The examples in this section have been taken directly from the original Awk paper.[1]

Printing

  • Print all input lines whose length exceeds 72 characters:
length > 72
  • Print all lines with an even number of fields:
NF % 2 == 0
  • Replace the first field of each line by its logarithm:
{ $1 = log($1); print }
  • Print the third and second columns of a table in that order:
{print $3,$2}
  • Print all input lines with an A, B, or C in the second field:
$2 ~ /A|B|C/
  • Print all lines in which the first field is different from the previous first field:
$1 != prev { print; prev = $1 }
  • Print each record preceded by the record number and the number of fields:
{print NR, NR, $0}
  • Write the first field, $1, on the foo foo1, and the second field on the file foo2:
{print $1>"foo1"; print $2>"foo2"}
  • Append the output to the file foo:
print $1>>"foo"
  • Use the contents of field 2 as a file name:
print $1>$2
  • Print $1 as a floating point number 8 digits wide, with two after the decimal point, and $2 as a 10-digit long decimal number, followed by a newline:
printf "%8.2f %10ld\n", $1, $2

BEGIN and END

Note: The special pattern BEGIN matches the beginning of the input, before the first record is read. The pattern END matches the end of the input, after the last record has been processed. BEGIN and END thus provide a way to gain control before and after processing, for initialization and wrapup.

  • As an example, the field separator can be set to a colon by:
BEGIN { FS = ":" }
... rest of program ...
  • Or, the input lines may be counted by:
END { print NR }

If BEGIN is present, it must be the first pattern; END must be the last if used.

Regular expressions

  • Print all lines which contain any occurance of the name "smith":
/smith/
  • () (parentheses) — for grouping;
  • | (pipe) — for alternatives;
  • + — for "one or more"; and
  • ? — for "zero or more"
  • Print all lines which contain any of the names "Aho", "Weinberger", or "Kernighan", whether capitalized or not:
/[Aa]ho|[Ww]einberger|[Kk]ernighan/
  • Match any string of characters enclosed in slashes:
/\/.*\//
  • Print all lines where the first field matches "john" or "John" (also matches "Johnson", etc.):
$1 ~ /[jJ]ohn/
  • Match exactly "john" or "John":
$1 ~ /^[jJ]ohn$/

Relational expressions

The relational operators are: <, <=, ==, !=, >=, and >.

  • Select lines where the second field is at least 100 greater than the first field:
$2 > $1 + 100
  • Print lines with an even number of fields:
NF % 2 == 0
  • Select lines that begin with an "s", "t", "u", etc.:
$1 >= "s"
  • Perform a string comparison (note: In the absence of any other information, fields are treated as strings):
$1 > $2

Conditional expressions

A conditional expression is a special kind of expression with three operands. It allows you to use one expression's value to select one of two other expressions.

The conditional expression looks the same as in the C language (uses the ?: ternary operator):

selector ? if-true-exp : if-false-exp

There are three subexpressions. The first, selector, is always computed first. If it is "true" (not zero and not null) then if-true-exp is computed next and its value becomes the value of the whole expression. Otherwise, if-false-exp is computed next and its value becomes the value of the whole expression.

For example, this expression produces the absolute value of x:

x > 0 ? x : -x

Each time the conditional expression is computed, exactly one of if-true-exp and if-false-exp is computed; the other is ignored. This is important when the expressions contain side effects. For example, this conditional expression examines element i of either array a or array b, and increments i:

x == y ? a[i++] : b[i++]

This is guaranteed to increment i exactly once, because each time one or the other of the two increment expressions is executed, and the other is not.

Combination of patterns

A pattern can be any boolean combination of patterns, using the operators || (or), && (and), and ! (not).

  • Select lines where the first field begins with "s", but is not "smith":
$1 >= "s" && $1 < "t" && $1 != "smith"

Pattern ranges

  • Print all lines between start and stop:
/start/,/stop/
  • Perform an action for lines 100 through 200:
NR == 100, NR == 200 { ... }
  • Print all lines from bar to end of file (or STDIN stream):
$ awk '/bar/,/!./{print $0}' <<< "$(echo -e "foo\nbar\nbaz")"

Built-in functions

see: String functions for full list.

Arithmetic functions include: sqrt, log, exp, and int

  • Print each record, preceded by its length:
{print length, $0}
  • Print the length of its argument (here the length of the input line):
{print length($0), $0}
  • Print lines whose length is less than 10 or greater than 20:
length < 10 || length > 20

The function substr(s,m,n) produces the substring of s that begins at position m (origin 1) and is at most n characters long. If n is omitted, the substring goes to the end of s

The function index(s1,s2) returns the position where the string s2 occurs in s1, or zero if it does not.

The function sprintf(f,e1,e2,...) produces the value of the expressions e1, e2, etc., in the printf format specified by f

  • Set x to the string produced by formatting the values of $1 and $2:
x = sprintf("%8.2f %10ld", $1, $2}

Variables, expressions, and assignments

The arithmetic operators (done internally in floating point) are: +, -, *, and % (mod). Increment ++ and decrement -- operators are also available, and so are the assignment operators: +=, -=, *=, /= and %=.

  • Assign 1 (a number) to x:
x=1
  • Assign "smith" (a string) to x:
x="smith"
  • Assign 7 (a number) to x:
x = "3" + "4"
  • Print the sums of the first two fields:
    {s1 += $1; s2 += $2}
END {print s1,s2}

Field variables

  • Replace the first field with a sequence number:
{$1 = NR; print}
  • Accumulate two fields into a third:
{$1 = $2 + $3; print $0}
  • Assign a string to a field (here, replace the third field by "too big" when it is):
{if($3 > 1000)
    $3 = "too big"
 print
}
  • Field references may be numerical expressions:
{print $i, $(i+1), $(i+n)}
  • Field treated as strings:
if($1==$2) ...
  • Split the string s into array[1],...,array[n] (note: If the sep argument is provided, it is used as the field separator; otherwise FS is used as the separator):
n = split(s,array,sep)

String concatenation

  • Return the length of the first three fields:
length($1 $2 $3)
  • Print the two fields separated by " is ":
print $1 " is " $2

Arrays

  • Assign the current input record to the NR-th element of the array x:
x[NR] = $0
  • Increment counts for the named array elements, and print them at the end of the input:
/apple/  {x["apple"]++}
/orange/ {x["orange"]++}
END      {print x["apple"], x["orange"]}

Flow-of-control statements

The flow-of-control statements are: if-else, while and for. Also, the break statement causes an immediate exit from an enclosing while or for; the continue statement causes the next iteration to begin. The next statement causes Awk to skip immediately to the next record and begin scanning the patterns from the top. The statement exit causes the program to behave as if the end of the input had occurred.

  • Print all input fields one per line:
i = 1
while(i <= NF) {
   print $i;
   ++i
}
  • Print all input fields one per line (same as above):
for(i=1; i<=NF; i++)
   print $i
  • Perform statement with i set in turn to each element of array (an associative array):
for(i in array)
   statement

Self-contained Awk scripts

As with many other programming languages, self-contained Awk script can be constructed using the so-called "shebang" syntax.

For example, a Linux command called hello.awk that prints the string "Hello, world!" may be built by going first creating a file named hello.awk containing the following lines:

#!/usr/bin/awk -f
BEGIN {print "Hello, world!"; exit}

Other examples

  • Print a count of words (count words in the input, and print lines, words, and characters (like wc)):
{ w += NF; c += length}
END { print NR, w, c }
  • Sum first column of input:
{ s += $1 }
END { print s }
  • Calculate word frequencies (uses associative arrays):
BEGIN { FS="[^a-zA-Z]+"}
{ for (i=1; i<=NF; i++)
    words[tolower($i)]++
}
END { for (i in words)
    print i, words[i]
}
  • Another way to calculate word frequencies (also uses associative arrays):
awk '{for(i = 1; i <=NF; i++)
        num[$i]++
    }
    END {for(word in num)
        print word, num[word]
    }
    ' $*

Associative arrays

Awk has built-in, language-level support for associative arrays.

For example:

phonebook["Sally Smart"] = "555-9999"
phonebook["John Doe"] = "555-1212"
phonebook["John Doe"] = "555-1337"

You can also loop through an associated array as follows:

for (name in phonebook) {
    print name, " ", phonebook[name]
}

You can also check if an element is in the associative array, and delete elements from an associative array.

Multi-dimensional associative arrays can be implemented in standard Awk using concatenation and e.g. SUBSEP:

{ # for every input line
    multi[$1 SUBSEP $2]++;
}
#
END {
    for (x in multi) {
        split(x, arr, SUBSEP);
        print arr[1], arr[2], multi[x];
    }
 }

Patrick's

Note: Taken directly, with some modifications, from Patrick Hartigan's Awk page. The # is the comment character for Awk. 'field' means 'column'.

  • Print length of string in 2nd column
awk '{print length($2)}' file
  • Add up first column, print sum and average:
     { s += $1 }
END  { print "sum is", s, " average is", s/NR }
  • Print fields in reverse order:
awk '{ for (i = NF; i > 0; --i) print $i }' file
  • Print the last line
    {line = $0}
END {print line}
  • Print the total number of lines that contain the word Stine
/Stine/ {nlines = nlines + 1}
END     {print nlines}
  • Print all lines between start/stop pairs:
awk '/start/,/stop/' file
  • Print all lines whose first field is different from previous one:
awk '$1 != prev {print; prev = $1}' file
  • Print column 3 if column 1 > column 2:
awk '$1 > $2 {print $3}' file
  • Print line if column 3 > column 2:
awk '$3 > $2' file
  • Count number of lines where col 3 > col 1
awk '$3 > $1 {print i + "1"; i++}' file
  • Print sequence number and then column 1 of file:
awk '{print NR, $1}' file
  • Print every line after erasing the 2nd field
awk '{$2 = ""; print}' file
  • Print "hi" 28 times
yes | head -28 | awk '{ print "hi" }'
  • Print "hi.0010" to "hi.0099":
yes | head -90 | awk '{printf("hi00%2.0f\n", NR+9)}'
  • Print out 4 random numbers between 0 and 1
yes | head -4 | awk '{print rand()}'
#~OR~
seq 1 4 | awk '{print rand()}'
  • Print out 40 random integers modulo 5
yes | head -40 | awk '{print int(100*rand()) % 5}'
  • Replace every field by its absolute value
{for (i = 1; i <= NF; i=i+1) if ($i < 0) $i = -$i print}
  • If you have another character that delimits fields, use the -F option. For example, to print out the phone number for "Brown" in the following file,
000902|Stevens|Alice|333-242-2222|149092
000901|Brown|Bob|532-382-0342|234023
...

the command would be

awk -F"|" '$2=="Brown"{print $4}' filename
  • Some looping commands (remove a bunch of print jobs from the queue):
BEGIN{
   for (i=875;i>833;i--){
      printf "lprm -Plw %d\n", i
   } exit
}
  • Formatted printouts are of the form printf( "format\n", value1, value2, ... valueN). For an example,
printf("howdy %-8s What it is bro. %.2f\n", $1, $2*$3)
# %s = string
# %-8s = 8 character string left justified
# %.2f = number with 2 places after .
# %6.2f = field 6 chars with 2 chars after .
# \n is newline
# \t is a tab
  • Print frequency histogram of column of numbers:
$2 <= 0.1 {na=na+1}
($2 > 0.1) && ($2 <= 0.2) {nb = nb+1}
($2 > 0.2) && ($2 <= 0.3) {nc = nc+1}
($2 > 0.3) && ($2 <= 0.4) {nd = nd+1}
($2 > 0.4) && ($2 <= 0.5) {ne = ne+1}
($2 > 0.5) && ($2 <= 0.6) {nf = nf+1}
($2 > 0.6) && ($2 <= 0.7) {ng = ng+1}
($2 > 0.7) && ($2 <= 0.8) {nh = nh+1}
($2 > 0.8) && ($2 <= 0.9) {ni = ni+1}
($2 > 0.9) {nj = nj+1}
END {print na, nb, nc, nd, ne, nf, ng, nh, ni, nj, NR}
  • Find maximum and minimum values present in column 1
NR == 1 {m=$1; p=$1}
$1 >= m {m = $1}
$1 <= p {p = $1}
END { print "Max = " m, "   Min = " p }
  • Example of defining variables, multiple commands on one line
NR == 1 {prev=$4; preva = $1; prevb = $2; n=0; sum=0}
$4 != prev {print preva, prevb, prev, sum/n; n=0; sum=0; prev = $4; preva = $1; prevb = $2}
$4 == prev {n++; sum=sum+$5/$6}
END {print preva, prevb, prev, sum/n}
  • Example of defining and using a function, inserting values into an array and doing integer arithmetic mod(n). This script finds the number of days elapsed since 1901-01-01. (from awkbookcode)
function daynum(y, m, d, days, i, n) { # 1 == Jan 1, 1901
   split("31 28 31 30 31 30 31 31 30 31 30 31", days)
   # 365 days a year, plus one for each leap year
   n = (y-1901) * 365 + int((y-1901)/4)
   if (y % 4 == 0) # leap year from 1901 to 2099
       days[2]++
   for (i = 1; i < m; i++)
       n += days[i]
   return n + d
}
{print daynum($1, $2, $3)}
  • Example of using substrings (substr($2,9,7) picks out characters 9 through 15 of column 2):
{print "imarith", substr($2,1,7) " - " $3, "out."substr($2,5,3)}
{print "imarith", substr($2,9,7) " - " $3, "out."substr($2,13,3)}
{print "imarith", substr($2,17,7) " - " $3, "out."substr($2,21,3)}
{print "imarith", substr($2,25,7) " - " $3, "out."substr($2,29,3)}

Christoph's additions

$ sort -rn Ecoli.top.travers | gawk '{if($1 <= x.xx) {print $1}}' | wc -l
$ gawk '{print $2}' Ecoli.top.travers | sort > Ecoli.top.travers.col2
$ gawk '{print $2}' Ecoli.top.cai | sort > Ecoli.top.cai.col2
$ comm -12 Ecoli.travers.col2 Ecoli.top.cai.col2 | wc -l
  • Remove duplicate lines:
$ echo -e 'foo\nbar\nbar\nbaz' | awk '!a[$0]++'

Date formats

  • Convert from "Jun 21 2011 5:00PM" to "2011-06-21":
$ echo "Jun 21 2011 5:00PM" | awk '{"date \"+%F\" -d \""$1" "$2" "$3" \"" | getline var; print var}'
  • Convert date formats from a CSV file:
$ cat infile.csv
foo,1,25/03/2012,bob
bar,2,14/11/2013,alice
awk 'BEGIN { FS = OFS = "," } 
     { split($3, date, /\//)
       $3 = date[3] "-" date[2] "-" date[1]
       print $0 
     }' infile.csv

Unix sockets

: see Unix sockets for more information.
  • Get a list of local IP addresses and ports listening via UDP:
$ netstat -planu | awk '/^udp / {print $4}'

Let's get the same info via the /proc pseudo-filesystem:

for h in $(awk 'NR>1{print $2}' /proc/net/udp); do
  printf "%s:%d\n" $(printf "%d." $(echo ${h%:*}|sed 's/../0x& /g'|tr ' ' '\n'|tac)|sed 's/\.$/\n/') 0x${h#*:};
done

Nice, but, let's do this in pure Awk:

awk 'NR>1{split($2, addr, ":"); for(i=0;i<4;i++){
  printf("%d.",strtonum("0x" substr(addr[1],2*i+1,2)))}; print ":" strtonum("0x" addr[2]);}' /proc/net/udp

Almost. Just need to reverse the dotted decimals.

echo 0F01A8C0 | awk '{str = sprintf("0x%s", $0); ip = strtonum(str); \
  printf ("%d.%d.%d.%d\t",rshift(and(ip,0x000000ff),00),
                          rshift(and(ip,0x0000ff00),08),
                          rshift(and(ip,0x00ff0000),16),
                          rshift(and(ip,0xff000000),24))}'

Final, pure Awk:

awk 'NR>1{split($2, a, ":");patsplit(a[1],h,/.{2}/);for(i=4;i>0;i--){
  h[i]=strtonum("0x" h[i]);};
  printf("%d.%d.%d.%d:%d\n",h[4],h[3],h[2],h[1],strtonum("0x" a[2]));}' /proc/net/udp

Scientific notation

  • Get the result of 90 billion / 22 million / 365:
$ awk 'BEGIN { print 90e9/22e6/365 }' # => 11.208
$ awk 'BEGIN { printf "%.4f", 90e9/22e6/365 }' # => 11.2080
  • Other methods:
$ perl -le 'print 90e9/22e6/365' # => 11.2079701120797
$ perl -e 'printf "%.4f\n", 90e9/22e6/365' # => 11.2080
$ echo "$(printf '%.4f' '90e9')/$(printf '%.4f' '22e6')/365" | bc -l # => 11.20797011207970112079
  • Regex:
$ sed -E 's/([+-]?[0-9.]+)[eE]\+?(-?)([0-9]+)/(\1*10^\2\3)/g' <<<"$value"
$ sed 's/\([+-]\{0,1\}[0-9]*\.\{0,1\}[0-9]\{1,\}\)[eE]+\{0,1\}\(-\{0,1\}\)\([0-9]\{1,\}\)/(\1*10^\2\3)/g' <<<"$value"
$ perl -pe 's/([-\d.]+)e(?:\+|(-))?(\d+)/($1*10^$2$3)/gi' <<<"$value"

Passing in Bash / environment variables

  • Check if version of Docker installed matched version required (as defined with Bash/env variable):
$ docker_version=17.03.2-ce
$ docker --version | \
   awk -v dv="${docker_version}" '/version/{gsub(",","");{if($3==dv){print "true"}else{print "false"}}}'

References

  1. Aho AV, Kernighan BW, Weinberger PJ (1978). "Awk - A pattern scanning and processing language". Second Edition, Bell Laboratories, 8 pp.

External links

  • AWK Book Code — contains all the programs from The AWK Programming Language, by Aho, Kernighan and Weinberger (Addison-Wesley, 1988). They have been packed by the bundle program found on page 81, and can be unpacked by the unbundle on page 82, also included here. A text editor will also do this pretty easily.
  • Patrick Hartigan's awk page