TAB file format

From Christoph's Personal Wiki
Jump to: navigation, search

A tab file is file format proposed by Rasmus Wernersson for annotating biological sequences.[1] It was inspired by programs and concepts developed by Søren Brunak, Kristoffer Rapacki, and Lars Juhl Jensen.

This tab file format is based on the original format proposed by Philip Lijnzaad (see: Bio::SeqIO::tab). It is a nearly raw sequence file input/output stream.

  • Reads/writes



gb2tab man page

Convert from GenBank to tab file.

gb2tab v 1.2.1 (command line program behind the FeatureExtract webserver) — extract sequence and annotation (intron/exon etc) from GenBank format files.

Synopsis / syntax

gb2tab [-f 'CDS,mRNA,...'] [options...] [files...]


gb2tab is a tool for extracting sequence and annotation (such as intron / exon structure) information from GenBank format files.

This tool handles overlapping genes gracefully.

If no files are specified input is assumed to be on STDIN. Several GenBank files can be concatenated to STDIN.

The extracted sequences are streamed to STDOUT with one entry per line in the following format (tab separated):

	name	seq	ann	com
	name:	The sequence id. See the --genename, --locustag and 
		--entryname options below.
	seq:	The DNA sequence it self. UPPERCASE is used for the
		main sequence, lowercase is used for flanks (if any).
	ann:	Single letter sequence annotation. Position for position
		the annotation descripes the DNA sequence: The first
		letter in the annotation, descriped the annotation for
		the first position in the DNA sequence and so forth.
		The annotation code is defined as follows:
		(	First position
		E	Exon
		T	tRNA exonic region
		R	rRNA / generic RNA exonic region
		P	Promotor
		X	Unknown feature type
		)	Last position
		?	Ambiguous first or last position
		[	First UTR region position
		3	3'UTR
		5	5'UTR
		]	Last UTR region position		
			See also the --block-chars option, for a further
			explanation of feature blocks and exonic regions.
		D	First intron position (donor site)
		I	Intron position
		A	Last intron position (acceptor site)
		<	Start of frameshift
		F	Frameshift
		>	End of frameshift
		.	NULL annotation (no annotation).
		+	Other feature defined on the SAME STRAND
			as the current entry.
		-	Other feature defined on the OPPOSITE STRAND
			relative to the current entry.
		#	Multiple or overlapping features.

		A..Z:	Feature on the SAME STRAND as the current entry.
		a..z:	Feature on the OPPOSITE STRAND as the current entry.
			See the -e option for a description of which features 
			are annotated in the flanking regions.
			The options --flank_ann_full (default) and 
			--flank_ann_presence determine if full annotation 
			(+upper/lower case) or annotation of presence/absence 
			(+/- and #) is used.

	com:	Comments (free text). All text, extra information etc
		defined in the GenBank files are concatenated into a single
		The following extra information is added by this program:
		*) GenBank accession ID.
		*) Source (organism)
		*) Feature type (e.g. "CDS" or "rRNA")
		*) Strand ("+" or "-").
		*) Spliced DNA sequence. Simply the DNA sequence defined
		   by the JOIN statement. 
		   This is provied for two reasons. 1) To overcome negative
		   frameshifts. 2) As an easy way of extracting the sequence
		   of the spliced producted. See also the --splic_always and
		   --flank_splic options below.
		*) Spliced DNA annotation.


The following options are available.

	-f X, --feature_type=X
		Define which feature type(s) to extract.
		Default is 'CDS' which is the most general way
		to annotate protein coding genes.
		Multiple features can be selected by specifying a comma
		separated list - for example "CDS,rRNA,tRNA".
		Special keywords:
		ALL: 	Using the keyword "ALL", will extend the list to all
			feature types listed in each GenBank file. 
			Please notice: This can occationally lead to problems 
			in files that use multiple feature types to cover the
			same actual feature (e.g uses both "gene" and "CDS").
		MOST:	Covers the following feature types:
		The keyword can be also be included in the user specified list.
		For example "MOST,novel_feature" will construct a list containing 
		the list mention above + the new feature type "novel_feature".

	-e X, --flank_features=X
		Define which features to annotate in flanking regions.
		The scheme for specifying features is the same as in the
		-f option (see above).
		The default value is "MOST". 
		If no flanking regions are requested (see options -b and -a
		below) this option is ignored.
	-i, --intergenic
		Extract intergenic regions. When this options is used all
		regions in between the features defined with the -f options
		in extracted rahter than the features themselves.
		Please notice that features specified using the -e options
		may be present in the intergenic regions.
		Intergenic regions will always be extracted from the "+" strand.
	-s, --splice
		For intron containing sequences output the spliced version as 
		the main result (normally this information goes into the 
		comments). If this options is used the full length product will
		be added to the comments instead.
		Using this option will force the inclusion of flanks (if any) 
		in the spliced product. See also option --flank_splic.
	-x, --spliced_only
		Only output intron containing sequences. Can the used in 
		combination with the -s option. 

	-b X, --flank_before=X
		Extract X basepairs upstream of each sequence.
	-a X, --flank_after=X
		Extract X basepairs downstream of each sequence.
	-h, --help
		Print this help page and exit.
	-n, --dry-run
		Run through all extraction steps but do not output any
		data. Useful for debugging bad GenBank files in combination
		with the verbose options.	
	-v, --verbose
		Output messages about progess, details about the GenBank
		file etc. to STDERR. Useful for finding errors.

	-q, --quiet
		Suppress all warnings, error messages and verbose info.
		The exit value will still be non-zero if an error is
		Annotate presence/absence and relative strandness of 
		features in the flanking regions.
		Features - of any kind - are annotated with "+" if they are
		on the SAME STRAND as the extratced feature, and "-" if they
		are on the OPPOSITE STRAND. "#" marks regions covered by
		multiple features.
		This option is very useful for use with OligoWiz-2.0
		Default: Include full-featured annotation in the flanking regions.
		Features on the SAME STRAND as the extracted is uppercase -
		features on the OPPOSITE STRAND is lowercase.
		In case of regions covered by multiple features, the
		feature defined FIRST by the -e option has preference.
		Also include flanking regions in the spliced product.
		Default is to ignore flanks.
		Include spliced producted for ALL entries.
		Default is to only print spliced product information for 
		intron/frameshift containing entries.
		"Introns" shorter than X bp (default 15bp) are considered 
		frameshifts. This includes negative frameshifts.
		Specify which characters to use for annotation of the 
		extracted feature types. For spliced feature (e.g CDS)
		each exonic block is annotated using the specified characters.
		Three characters must be supplied (for each feature type): 
		First position, internal positions, last position.
		For example the string "(E)" will cause a 10bp feature block 
		(e.i a CDS exon block) to be annotated like this: (EEEEEEEE)
		Introns are filled in as DII..IIA
		By default the program determine the annotation chars to be
		based on the type of feature being extracted:
		(E)	CDS, mRNA
		(T)	tRNA
		(R)	rRNA, snoRNA, snRNA, misc_RNA, scRNA
		(P)	promotor
		[5]	5'UTR
		[3]	3'UTR
		(X)	Everything else.
		This table can be expanded (and overwritten) by supplying a
		list of relations between feature type ans block chars.
		For example:
		Try to extract the gene name from the /gene="xxxx"
		tag (this is usually the classical gene name, e.g. HTA1)
		If this is not possible fall back to 1) locustag
		or 2) entryname (see below).
		Try to extract the locus tag (usually the systematic
		gene name) from the /locus_tag="xxxx" tag. Fall back
		to using the entryname if not possible (see below).
		This is the default behavior.
		Use the main GenBank entry name (the "LOCUS" name) as
		the base of the sequence names.

Known issues

This program DOES NOT support entries which spans multiple GenBank files. It is very unlikely this will ever be supported. (Please notice that the webserver version supports expanding reference GenBank entries to the listed subentries automatically).

Example output

Example usage

The tab file format is extremely convenient for string matching with regex. For an example, if we wanted to search for the position of the string of nucleotides "TTTAAGAGGGG" in the file (found on the FeatureExtract 1.2 Server website), we could issue the following command:

cat |\
gawk '{split($0,s,"\t"); print s[2]}' |\
gawk '{print index($0,"TTTAAGAGGGG")}' -

It should return "333"; the position of the first "T" in the sequence.


  1. Wernersson R (2005). FeatureExtract—extraction of sequence annotation made easy. Nuc Acid Res, 33:W567-W569. DOI:10.1093/nar/gki388

See also

External links

  • FeatureExtract 1.2 Server — The webpage contains detailed instructions and examples. The most recent version of this program is downloadable from this web address.
  • gb2tab (download)