Here document

From Christoph's Personal Wiki
Jump to: navigation, search

In computing, a here document (here-document, here-text, heredoc, hereis, here-string, or here-script) is a file literal or input stream literal: it is a section of a source code file that is treated as if it were a separate file. The term is also used for a form of multiline string literals that use similar syntax, preserving line breaks and other whitespace (including indentation) in the text.

File literals

Narrowly speaking, here documents are file literals or stream literals. These originate in the Unix shell, though similar facilities are available in some other languages.

Unix shells

Here documents are available in many Unix shells. In the following example, text is passed to the tr command (transliterating lower to upper-case) using a here document. This could be in a shell file or entered interactively at a prompt.

$ LANG=C tr a-z A-Z << END_TEXT
> one two three
> four five six
> END_TEXT
ONE TWO THREE
FOUR FIVE SIX

END_TEXT was used as the delimiting identifier. It specified the start and end of the here document. The redirect and the delimiting identifier do not need to be separated by a space: <<END_TEXT or << END_TEXT both work equally well.

By default, the behaviour is largely identical to the contents of double quotes: variable names are replaced by their values, commands within backticks are evaluated, etc.

$ cat << EOF
> \$ Working dir "$PWD" `pwd`
> EOF
$ Working dir "/home/user" /home/user

This can be disabled by quoting any part of the label, which is then ended by the unquoted value. "Quoting" includes escaping, so if \EOF is used, this is quoted, so variable interpolation does not occur, and it ends with EOF, while if \\EOF is used, this is quoted and ends with \EOF. This perhaps surprising behaviour is however easily implemented in a shell, by the tokenizer simply recording a token was quoted (during the evaluation phase of lexical analysis), without needing to preserve the original, quoted value.
One application is to use \' as the starting delimiter and thus ' as the ending delimiter, which is similar to a multiline string literal but stripping starting and ending linebreaks. The behaviour is essentially identical to that if the contents were enclosed in single quotes. Thus for example by setting it in single quotes:

$ cat << 'EOF'
> \$ Working dir "$PWD" `pwd`
> EOF
\$ Working dir "$PWD" `pwd`

Double quotes may also be used, but this is subject to confusion, because expansion does occur in a double-quoted string, but does not occur in a here document with double-quoted delimiter. Single- and double-quoted delimiters are distinguished in some other languages, notably Perl (see below), where behaviour parallels the corresponding string quoting.

Appending a minus sign to the << (i.e. <<-) has the effect that leading tabs are ignored. (Not in csh or tcsh.) This allows indenting here documents in shell scripts (primarily for alignment with existing indentation) without changing their value. (Note that while tabs can typically be entered in editors, at the command line they are typically entered by Ctrl-V + Tab instead, due to tab completion, and in the example, they are actual tabs, so the example can be copy and pasted.)

A script containing:

LANG=C tr a-z A-Z <<- END_TEXT
Here doc with <<-
 A single space character (i.e. 0x20 )  is at the beginnning of this line
	This line begins with a single TAB character i.e 0x09  as does the next line
	END_TEXT

echo The intended end was before this line 
echo and these were not processed by tr
echo +++++++++++++++

LANG=C tr a-z A-Z << END_TEXT
Here doc with <<
 A single space character (i.e. 0x20 )  is at the beginning of this line
	This line begins with a single TAB character i.e 0x09 as does the next line
	END_TEXT

echo The intended end was before this line, 
echo but because the line with the delimiting Identifier began with a TAB it was NOT recognized and
echo the tr command continued processing.

produces:

HERE DOC WITH <<-
 A SINGLE SPACE CHARACTER (I.E. 0X20 )  IS AT THE BEGINNING OF THIS LINE
THIS LINE BEGINS WITH A SINGLE TAB CHARACTER I.E 0X09  AS DOES THE NEXT LINE
The intended end was before this line
and these were not processed by tr
+++++++++++++++
HERE DOC WITH <<
 A SINGLE SPACE CHARACTER (I.E. 0X20 )  IS AT THE BEGINNNING OF THIS LINE
	THIS LINE BEGINS WITH A SINGLE TAB CHARACTER I.E 0X09 AS DOES THE NEXT LINE
	END_TEXT

ECHO THE INTENDED END WAS BEFORE THIS LINE, 
ECHO BUT BECAUSE THE LINE WITH THE DELIMITING IDENTIFIER BEGAN WITH A TAB IT WAS NOT RECOGNIZED AND
ECHO THE TR COMMAND CONTINUED PROCESSING.

Another use is to output to a file:

cat << EOF > ~/testFile001
>   3 spaces precede this text.
>	A single tab character is at the beginning of this line.
>Nothing precedes this text
EOF

Here strings

A here string (available in bash, ksh, or zsh) is syntactically similar, consisting of <<<, and effects input redirection from a word (a sequence treated as a unit by the shell, in this context generally a string literal). In this case, the usual shell syntax is used for the word ("here string syntax"), with the only syntax being the redirection: a here string is an ordinary string used for input redirection, not a special kind of string.

A single word need not be quoted:

$ LANG=C tr a-z A-Z <<< one
ONE

In case of a string with spaces, it must be quoted:

$ LANG=C tr a-z A-Z <<< 'one two three'
ONE TWO THREE

This could also be written as:

$ foo='one two three'
$ LANG=C tr a-z A-Z <<< "$foo"
ONE TWO THREE

Multiline strings are acceptable, yielding:

$ LANG=C tr a-z A-Z <<< 'one
> two three'
ONE
TWO THREE

Note that leading and trailing newlines, if present, are included:

$ LANG=C tr a-z A-Z <<< '
> one
> two three
> '

ONE
TWO THREE

$

The key difference from here documents is that, in here documents, the delimiters are on separate lines; the leading and trailing newlines are stripped. Unlike here documents, here strings do not use delimiters.

Here strings are particularly useful for commands that often take short input, such as the calculator bc:

$ bc <<< 2^10
1024

#~OR~
$ for i in $(seq 1 10); do bc <<< 2^$i; done
2
4
8
16
32
64
128
256
512
1024

Note that here string behavior can also be accomplished (reversing the order) via piping and the echo command, as in:

$ echo 'one two three' | LANG=C tr a-z A-Z
ONE TWO THREE

however here strings are particularly useful when the last command needs to run in the current process, as is the case with the read builtin:

$ echo 'one two three' | read -r a b c
$ echo "$a $b $c"

yields nothing, while

$ read -r a b c <<< 'one two three'
$ echo "$a $b $c"
one two three

This happens because in the previous example piping causes read to run in a subprocess, and as such can not affect the environment of the parent process.