Difference between revisions of "TAB file format"
(→gb2tab man page) |
|||
(7 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | A '''tab file''' is file format | + | A '''tab file''' is file format proposed by Rasmus Wernersson for annotating biological sequences.<ref>Wernersson R (2005). FeatureExtract—extraction of sequence annotation made easy. ''Nuc Acid Res, 33:W567-W569''. {{doi|10.1093/nar/gki388}}</ref> It was inspired by programs and concepts developed by Søren Brunak, Kristoffer Rapacki, and Lars Juhl Jensen. |
− | == Example == | + | This <code>tab</code> file format is based on the original format proposed by Philip Lijnzaad (see: [http://search.cpan.org/dist/bioperl/Bio/SeqIO/tab.pm Bio::SeqIO::tab]). It is a nearly raw sequence file input/output stream. |
+ | * Reads/writes | ||
+ | id"\t"sequence"\n" | ||
+ | |||
+ | ==Example== | ||
Sequence : ATGTCTACATATGAAGGTATGTAA | Sequence : ATGTCTACATATGAAGGTATGTAA | ||
Annotation: (EEEEEEEEEEEEEE)DIIIIIII | Annotation: (EEEEEEEEEEEEEE)DIIIIIII | ||
− | == gb2tab man page == | + | ==gb2tab man page== |
− | + | Convert from [[GenBank]] to tab file. | |
− | + | ||
− | + | gb2tab v 1.2.1 (command line program behind the FeatureExtract webserver) — extract sequence and annotation (intron/exon etc) from GenBank format files. | |
− | + | ||
− | + | ===Synopsis / syntax=== | |
− | + | gb2tab [-f 'CDS,mRNA,...'] [options...] [files...] | |
− | + | ||
− | + | ===Description=== | |
− | + | gb2tab is a tool for extracting sequence and annotation (such as intron / exon structure) information from GenBank format files. | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
+ | This tool handles overlapping genes gracefully. | ||
+ | |||
+ | If no files are specified input is assumed to be on STDIN. Several GenBank files can be concatenated to STDIN. | ||
+ | |||
+ | The extracted sequences are streamed to STDOUT with one entry per line in the following format (tab separated): | ||
+ | <pre> | ||
name seq ann com | name seq ann com | ||
Line 115: | Line 112: | ||
--flank_splic options below. | --flank_splic options below. | ||
*) Spliced DNA annotation. | *) Spliced DNA annotation. | ||
− | + | </pre> | |
− | + | ===Options=== | |
− | + | The following options are available. | |
− | + | <pre> | |
-f X, --feature_type=X | -f X, --feature_type=X | ||
Define which feature type(s) to extract. | Define which feature type(s) to extract. | ||
Line 286: | Line 283: | ||
Use the main GenBank entry name (the "LOCUS" name) as | Use the main GenBank entry name (the "LOCUS" name) as | ||
the base of the sequence names. | the base of the sequence names. | ||
− | + | </pre> | |
− | + | ===Known issues=== | |
− | + | This program DOES NOT support entries which spans multiple GenBank files. It is very unlikely this will ever be supported. (Please notice that the webserver version supports expanding reference GenBank entries to the listed subentries automatically). | |
− | + | ||
− | + | ||
− | + | ||
− | + | ==Example output== | |
− | + | *[http://www.cbs.dtu.dk/services/FeatureExtract/datasets/SampleOutput.html Sample Output] | |
− | + | ||
− | + | ||
− | + | ==Example usage== | |
− | + | The tab file format is extremely convenient for string matching with [[regex]]. For an example, if we wanted to search for the ''position'' of the string of nucleotides "<code>TTTAAGAGGGG</code>" in the file [http://www.cbs.dtu.dk/services/FeatureExtract/datasets/yeast_genome.with_introns.tab yeast_genome.with_introns.tab] (found on the [http://www.cbs.dtu.dk/services/FeatureExtract/ FeatureExtract 1.2 Server] website), we could issue the following command: | |
− | + | cat yeast_genome.with_introns.tab |\ | |
− | + | gawk '{split($0,s,"\t"); print s[2]}' |\ | |
− | + | gawk '{print index($0,"TTTAAGAGGGG")}' - | |
− | + | It should return "<code>333</code>"; the position of the first "<code>T</code>" in the sequence. | |
− | + | ==References== | |
− | + | <references/> | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | </ | + | |
− | == | + | ==See also== |
− | + | *[[HOW file format]] | |
+ | *[http://www.cgb.ki.se/cgb/groups/sonnhammer/Stockholm.html The "Stockholm" format] — a system for marking up features in a multiple alignment. | ||
+ | *[[Column file format]] | ||
− | == External links == | + | ==External links== |
− | * [http://www.cbs.dtu.dk/services/FeatureExtract/ FeatureExtract 1.2 Server] | + | *[http://www.cbs.dtu.dk/services/FeatureExtract/ FeatureExtract 1.2 Server] — The webpage contains detailed instructions and examples. The most recent version of this program is downloadable from this web address. |
− | * [http://www.cbs.dtu.dk/services/FeatureExtract/download.php gb2tab] (download) | + | *[http://www.cbs.dtu.dk/services/FeatureExtract/download.php gb2tab] (download) |
[[Category:Bioinformatics]] | [[Category:Bioinformatics]] |
Latest revision as of 01:05, 13 July 2012
A tab file is file format proposed by Rasmus Wernersson for annotating biological sequences.[1] It was inspired by programs and concepts developed by Søren Brunak, Kristoffer Rapacki, and Lars Juhl Jensen.
This tab
file format is based on the original format proposed by Philip Lijnzaad (see: Bio::SeqIO::tab). It is a nearly raw sequence file input/output stream.
- Reads/writes
id"\t"sequence"\n"
Contents
Example
Sequence : ATGTCTACATATGAAGGTATGTAA Annotation: (EEEEEEEEEEEEEE)DIIIIIII
gb2tab man page
Convert from GenBank to tab file.
gb2tab v 1.2.1 (command line program behind the FeatureExtract webserver) — extract sequence and annotation (intron/exon etc) from GenBank format files.
Synopsis / syntax
gb2tab [-f 'CDS,mRNA,...'] [options...] [files...]
Description
gb2tab is a tool for extracting sequence and annotation (such as intron / exon structure) information from GenBank format files.
This tool handles overlapping genes gracefully.
If no files are specified input is assumed to be on STDIN. Several GenBank files can be concatenated to STDIN.
The extracted sequences are streamed to STDOUT with one entry per line in the following format (tab separated):
name seq ann com name: The sequence id. See the --genename, --locustag and --entryname options below. seq: The DNA sequence it self. UPPERCASE is used for the main sequence, lowercase is used for flanks (if any). ann: Single letter sequence annotation. Position for position the annotation descripes the DNA sequence: The first letter in the annotation, descriped the annotation for the first position in the DNA sequence and so forth. The annotation code is defined as follows: FEATURE BLOCKS (AKA. "EXON BLOCKS") ( First position E Exon T tRNA exonic region R rRNA / generic RNA exonic region P Promotor X Unknown feature type ) Last position ? Ambiguous first or last position [ First UTR region position 3 3'UTR 5 5'UTR ] Last UTR region position See also the --block-chars option, for a further explanation of feature blocks and exonic regions. INTRONS and FRAMESHIFTS D First intron position (donor site) I Intron position A Last intron position (acceptor site) < Start of frameshift F Frameshift > End of frameshift REGIONS WITHOUT FEATURES . NULL annotation (no annotation). ONLY IN FLANKING REGIONS: + Other feature defined on the SAME STRAND as the current entry. - Other feature defined on the OPPOSITE STRAND relative to the current entry. # Multiple or overlapping features. A..Z: Feature on the SAME STRAND as the current entry. a..z: Feature on the OPPOSITE STRAND as the current entry. See the -e option for a description of which features are annotated in the flanking regions. The options --flank_ann_full (default) and --flank_ann_presence determine if full annotation (+upper/lower case) or annotation of presence/absence (+/- and #) is used. com: Comments (free text). All text, extra information etc defined in the GenBank files are concatenated into a single comment. The following extra information is added by this program: *) GenBank accession ID. *) Source (organism) *) Feature type (e.g. "CDS" or "rRNA") *) Strand ("+" or "-"). *) Spliced DNA sequence. Simply the DNA sequence defined by the JOIN statement. This is provied for two reasons. 1) To overcome negative frameshifts. 2) As an easy way of extracting the sequence of the spliced producted. See also the --splic_always and --flank_splic options below. *) Spliced DNA annotation.
Options
The following options are available.
-f X, --feature_type=X Define which feature type(s) to extract. Default is 'CDS' which is the most general way to annotate protein coding genes. Multiple features can be selected by specifying a comma separated list - for example "CDS,rRNA,tRNA". Special keywords: ALL: Using the keyword "ALL", will extend the list to all feature types listed in each GenBank file. Please notice: This can occationally lead to problems in files that use multiple feature types to cover the same actual feature (e.g uses both "gene" and "CDS"). MOST: Covers the following feature types: CDS,3'UTR,5'UTR, promoter,-35_signal,-10_signal,RBS, rRNA,tRNA,snoRNA,scRNA,misc_RNA, misc_feature The keyword can be also be included in the user specified list. For example "MOST,novel_feature" will construct a list containing the list mention above + the new feature type "novel_feature". -e X, --flank_features=X Define which features to annotate in flanking regions. The scheme for specifying features is the same as in the -f option (see above). The default value is "MOST". If no flanking regions are requested (see options -b and -a below) this option is ignored. -i, --intergenic Extract intergenic regions. When this options is used all regions in between the features defined with the -f options in extracted rahter than the features themselves. Please notice that features specified using the -e options may be present in the intergenic regions. Intergenic regions will always be extracted from the "+" strand. -s, --splice For intron containing sequences output the spliced version as the main result (normally this information goes into the comments). If this options is used the full length product will be added to the comments instead. Using this option will force the inclusion of flanks (if any) in the spliced product. See also option --flank_splic. -x, --spliced_only Only output intron containing sequences. Can the used in combination with the -s option. -b X, --flank_before=X Extract X basepairs upstream of each sequence. -a X, --flank_after=X Extract X basepairs downstream of each sequence. -h, --help Print this help page and exit. -n, --dry-run Run through all extraction steps but do not output any data. Useful for debugging bad GenBank files in combination with the verbose options. -v, --verbose Output messages about progess, details about the GenBank file etc. to STDERR. Useful for finding errors. -q, --quiet Suppress all warnings, error messages and verbose info. The exit value will still be non-zero if an error is encountered. --flank_ann_presence Annotate presence/absence and relative strandness of features in the flanking regions. Features - of any kind - are annotated with "+" if they are on the SAME STRAND as the extratced feature, and "-" if they are on the OPPOSITE STRAND. "#" marks regions covered by multiple features. This option is very useful for use with OligoWiz-2.0 (www.cbs.dtu.dk/services/OligoWiz2). --flank_ann_full Default: Include full-featured annotation in the flanking regions. Features on the SAME STRAND as the extracted is uppercase - features on the OPPOSITE STRAND is lowercase. In case of regions covered by multiple features, the feature defined FIRST by the -e option has preference. --flank_splic Also include flanking regions in the spliced product. Default is to ignore flanks. --splic_always Include spliced producted for ALL entries. Default is to only print spliced product information for intron/frameshift containing entries. --frameshift=X "Introns" shorter than X bp (default 15bp) are considered frameshifts. This includes negative frameshifts. --block-chars=XYZ|"Feat1=XYZ,Feat2=ZYX,..." Specify which characters to use for annotation of the extracted feature types. For spliced feature (e.g CDS) each exonic block is annotated using the specified characters. Three characters must be supplied (for each feature type): First position, internal positions, last position. For example the string "(E)" will cause a 10bp feature block (e.i a CDS exon block) to be annotated like this: (EEEEEEEE) Introns are filled in as DII..IIA By default the program determine the annotation chars to be based on the type of feature being extracted: (E) CDS, mRNA (T) tRNA (R) rRNA, snoRNA, snRNA, misc_RNA, scRNA (P) promotor [5] 5'UTR [3] 3'UTR (X) Everything else. This table can be expanded (and overwritten) by supplying a list of relations between feature type ans block chars. For example: --block-chars="mRNA=[M],gene=,repeat=QQQ" --genename Try to extract the gene name from the /gene="xxxx" tag (this is usually the classical gene name, e.g. HTA1) If this is not possible fall back to 1) locustag or 2) entryname (see below). --locustag Try to extract the locus tag (usually the systematic gene name) from the /locus_tag="xxxx" tag. Fall back to using the entryname if not possible (see below). This is the default behavior. --entryname Use the main GenBank entry name (the "LOCUS" name) as the base of the sequence names.
Known issues
This program DOES NOT support entries which spans multiple GenBank files. It is very unlikely this will ever be supported. (Please notice that the webserver version supports expanding reference GenBank entries to the listed subentries automatically).
Example output
Example usage
The tab file format is extremely convenient for string matching with regex. For an example, if we wanted to search for the position of the string of nucleotides "TTTAAGAGGGG
" in the file yeast_genome.with_introns.tab (found on the FeatureExtract 1.2 Server website), we could issue the following command:
cat yeast_genome.with_introns.tab |\ gawk '{split($0,s,"\t"); print s[2]}' |\ gawk '{print index($0,"TTTAAGAGGGG")}' -
It should return "333
"; the position of the first "T
" in the sequence.
References
- ↑ Wernersson R (2005). FeatureExtract—extraction of sequence annotation made easy. Nuc Acid Res, 33:W567-W569. DOI:10.1093/nar/gki388
See also
- HOW file format
- The "Stockholm" format — a system for marking up features in a multiple alignment.
- Column file format
External links
- FeatureExtract 1.2 Server — The webpage contains detailed instructions and examples. The most recent version of this program is downloadable from this web address.
- gb2tab (download)