Difference between revisions of "TAB file format"
From Christoph's Personal Wiki
| Line 1: | Line 1: | ||
A '''tab file''' is file format proposed by Rasmus Wernersson for annotating biological sequences.<ref>Wernersson R (2005). FeatureExtract—extraction of sequence annotation made easy. ''Nuc Acid Res, 33:W567-W569''. {{doi|10.1093/nar/gki388}}</ref> It was inspired by programs and concepts developed by Søren Brunak, Kristoffer Rapacki, and Lars Juhl Jensen. | A '''tab file''' is file format proposed by Rasmus Wernersson for annotating biological sequences.<ref>Wernersson R (2005). FeatureExtract—extraction of sequence annotation made easy. ''Nuc Acid Res, 33:W567-W569''. {{doi|10.1093/nar/gki388}}</ref> It was inspired by programs and concepts developed by Søren Brunak, Kristoffer Rapacki, and Lars Juhl Jensen. | ||
| + | |||
| + | This <code>tab</code> file format is based on the original format proposed by Philip Lijnzaad (see: [http://search.cpan.org/dist/bioperl/Bio/SeqIO/tab.pm Bio::SeqIO::tab]). It is a nearly raw sequence file input/output stream. | ||
| + | * Reads/writes | ||
| + | id"\t"sequence"\n" | ||
== Example == | == Example == | ||
| Line 6: | Line 10: | ||
== gb2tab man page == | == gb2tab man page == | ||
| + | Convert from GenBank to tab file. | ||
<pre> | <pre> | ||
gb2tab v 1.2 (command line program behind the FeatureExtract webserver) | gb2tab v 1.2 (command line program behind the FeatureExtract webserver) | ||
Revision as of 01:19, 17 October 2006
A tab file is file format proposed by Rasmus Wernersson for annotating biological sequences.[1] It was inspired by programs and concepts developed by Søren Brunak, Kristoffer Rapacki, and Lars Juhl Jensen.
This tab file format is based on the original format proposed by Philip Lijnzaad (see: Bio::SeqIO::tab). It is a nearly raw sequence file input/output stream.
- Reads/writes
id"\t"sequence"\n"
Example
Sequence : ATGTCTACATATGAAGGTATGTAA Annotation: (EEEEEEEEEEEEEE)DIIIIIII
gb2tab man page
Convert from GenBank to tab file.
gb2tab v 1.2 (command line program behind the FeatureExtract webserver)
NAME
gb2tab - extract sequence and annotation (intron/exon etc)
from GenBank format files.
SYNOPSIS
gb2tab [-f 'CDS,mRNA,...'] [options...] [files...]
DESCRIPTION
gb2tab is a tool for extracting sequence and annotation
(such as intron / exon structure) information from GenBank
format files.
This tool handles overlapping genes gracefully.
If no files are specified input is assumed to be on STDIN.
Several GenBank files can be concatenated to STDIN.
The extracted sequences are streamed to STDOUT with one
entry per line in the following format (tab separated):
name seq ann com
name: The sequence id. See the --genename, --locustag and
--entryname options below.
seq: The DNA sequence it self. UPPERCASE is used for the
main sequence, lowercase is used for flanks (if any).
ann: Single letter sequence annotation. Position for position
the annotation descripes the DNA sequence: The first
letter in the annotation, descriped the annotation for
the first position in the DNA sequence and so forth.
The annotation code is defined as follows:
FEATURE BLOCKS (AKA. "EXON BLOCKS")
( First position
E Exon
T tRNA exonic region
R rRNA / generic RNA exonic region
P Promotor
X Unknown feature type
) Last position
? Ambiguous first or last position
[ First UTR region position
3 3'UTR
5 5'UTR
] Last UTR region position
See also the --block-chars option, for a further
explanation of feature blocks and exonic regions.
INTRONS and FRAMESHIFTS
D First intron position (donor site)
I Intron position
A Last intron position (acceptor site)
< Start of frameshift
F Frameshift
> End of frameshift
REGIONS WITHOUT FEATURES
. NULL annotation (no annotation).
ONLY IN FLANKING REGIONS:
+ Other feature defined on the SAME STRAND
as the current entry.
- Other feature defined on the OPPOSITE STRAND
relative to the current entry.
# Multiple or overlapping features.
A..Z: Feature on the SAME STRAND as the current entry.
a..z: Feature on the OPPOSITE STRAND as the current entry.
See the -e option for a description of which features
are annotated in the flanking regions.
The options --flank_ann_full (default) and
--flank_ann_presence determine if full annotation
(+upper/lower case) or annotation of presence/absence
(+/- and #) is used.
com: Comments (free text). All text, extra information etc
defined in the GenBank files are concatenated into a single
comment.
The following extra information is added by this program:
*) GenBank accession ID.
*) Source (organism)
*) Feature type (e.g. "CDS" or "rRNA")
*) Strand ("+" or "-").
*) Spliced DNA sequence. Simply the DNA sequence defined
by the JOIN statement.
This is provied for two reasons. 1) To overcome negative
frameshifts. 2) As an easy way of extracting the sequence
of the spliced producted. See also the --splic_always and
--flank_splic options below.
*) Spliced DNA annotation.
OPTIONS
The following options are available.
-f X, --feature_type=X
Define which feature type(s) to extract.
Default is 'CDS' which is the most general way
to annotate protein coding genes.
Multiple features can be selected by specifying a comma
separated list - for example "CDS,rRNA,tRNA".
Special keywords:
ALL: Using the keyword "ALL", will extend the list to all
feature types listed in each GenBank file.
Please notice: This can occationally lead to problems
in files that use multiple feature types to cover the
same actual feature (e.g uses both "gene" and "CDS").
MOST: Covers the following feature types:
CDS,3'UTR,5'UTR,
promoter,-35_signal,-10_signal,RBS,
rRNA,tRNA,snoRNA,scRNA,misc_RNA,
misc_feature
The keyword can be also be included in the user specified list.
For example "MOST,novel_feature" will construct a list containing
the list mention above + the new feature type "novel_feature".
-e X, --flank_features=X
Define which features to annotate in flanking regions.
The scheme for specifying features is the same as in the
-f option (see above).
The default value is "MOST".
If no flanking regions are requested (see options -b and -a
below) this option is ignored.
-i, --intergenic
Extract intergenic regions. When this options is used all
regions in between the features defined with the -f options
in extracted rahter than the features themselves.
Please notice that features specified using the -e options
may be present in the intergenic regions.
Intergenic regions will always be extracted from the "+" strand.
-s, --splice
For intron containing sequences output the spliced version as
the main result (normally this information goes into the
comments). If this options is used the full length product will
be added to the comments instead.
Using this option will force the inclusion of flanks (if any)
in the spliced product. See also option --flank_splic.
-x, --spliced_only
Only output intron containing sequences. Can the used in
combination with the -s option.
-b X, --flank_before=X
Extract X basepairs upstream of each sequence.
-a X, --flank_after=X
Extract X basepairs downstream of each sequence.
-h, --help
Print this help page and exit.
-n, --dry-run
Run through all extraction steps but do not output any
data. Useful for debugging bad GenBank files in combination
with the verbose options.
-v, --verbose
Output messages about progess, details about the GenBank
file etc. to STDERR. Useful for finding errors.
-q, --quiet
Suppress all warnings, error messages and verbose info.
The exit value will still be non-zero if an error is
encountered.
--flank_ann_presence
Annotate presence/absence and relative strandness of
features in the flanking regions.
Features - of any kind - are annotated with "+" if they are
on the SAME STRAND as the extratced feature, and "-" if they
are on the OPPOSITE STRAND. "#" marks regions covered by
multiple features.
This option is very useful for use with OligoWiz-2.0
(www.cbs.dtu.dk/services/OligoWiz2).
--flank_ann_full
Default: Include full-featured annotation in the flanking regions.
Features on the SAME STRAND as the extracted is uppercase -
features on the OPPOSITE STRAND is lowercase.
In case of regions covered by multiple features, the
feature defined FIRST by the -e option has preference.
--flank_splic
Also include flanking regions in the spliced product.
Default is to ignore flanks.
--splic_always
Include spliced producted for ALL entries.
Default is to only print spliced product information for
intron/frameshift containing entries.
--frameshift=X
"Introns" shorter than X bp (default 15bp) are considered
frameshifts. This includes negative frameshifts.
--block-chars=XYZ|"Feat1=XYZ,Feat2=ZYX,..."
Specify which characters to use for annotation of the
extracted feature types. For spliced feature (e.g CDS)
each exonic block is annotated using the specified characters.
Three characters must be supplied (for each feature type):
First position, internal positions, last position.
For example the string "(E)" will cause a 10bp feature block
(e.i a CDS exon block) to be annotated like this: (EEEEEEEE)
Introns are filled in as DII..IIA
By default the program determine the annotation chars to be
based on the type of feature being extracted:
(E) CDS, mRNA
(T) tRNA
(R) rRNA, snoRNA, snRNA, misc_RNA, scRNA
(P) promotor
[5] 5'UTR
[3] 3'UTR
(X) Everything else.
This table can be expanded (and overwritten) by supplying a
list of relations between feature type ans block chars.
For example:
--block-chars="mRNA=[M],gene=,repeat=QQQ"
--genename
Try to extract the gene name from the /gene="xxxx"
tag (this is usually the classical gene name, e.g. HTA1)
If this is not possible fall back to 1) locustag
or 2) entryname (see below).
--locustag
Try to extract the locus tag (usually the systematic
gene name) from the /locus_tag="xxxx" tag. Fall back
to using the entryname if not possible (see below).
This is the default behavior.
--entryname
Use the main GenBank entry name (the "LOCUS" name) as
the base of the sequence names.
KNOWN ISSUES
This program DOES NOT support entries which spans multiple
GenBank files. It is very unlikely this will ever be supported.
(Please notice that the webserver version supports expanding
reference GenBank entries to the listed subentries automatically).
REFERENCE
Rasmus Wernersson, 2005.
"FeatureExtract - extraction of sequence annotation made easy".
Nucleic Acids Research, 2005, Vol. 33, Web Server issue W567-W569
WEB
http://www.cbs.dtu.dk/services/FeatureExtract
The webpage contains detailed instructions and examples.
The most recent version of this program is downloadable
from this web address.
AUTHOR
Rasmus Wernersson, raz@cbs.dtu.dk
Oct-Dec 2004
Jan-Mar 2005
Aug 2005
References
- ↑ Wernersson R (2005). FeatureExtract—extraction of sequence annotation made easy. Nuc Acid Res, 33:W567-W569. DOI:10.1093/nar/gki388
External links
- FeatureExtract 1.2 Server
- gb2tab (download)