Difference between revisions of "LFasta"
(→LFasta related Scripts) |
|||
Line 1: | Line 1: | ||
− | The '''Labeled Fasta''' ('''LFasta''') format was invented by [http://www.binf.ku.dk/User:Krogh Anders Krogh]. It is a variant of the [[FASTA format]]. | + | The '''Labeled Fasta''' ('''LFasta''' / '''lfa''') format was invented by [http://www.binf.ku.dk/User:Krogh Anders Krogh]. It is a variant of the [[FASTA format]]. |
==Syntax== | ==Syntax== | ||
Line 18: | Line 18: | ||
# HHHxxxxxxxHHHHxxxxxxxxx | # HHHxxxxxxxHHHHxxxxxxxxx | ||
</pre> | </pre> | ||
− | Here the '#' | + | Here the '#' precedes the sequence of labels. It doesn't matter whether the entire sequence comes before the labels or if the lines are mixed, so this for instance would do as well: |
<pre> | <pre> | ||
>1IRK._ TRANSFERASE | >1IRK._ TRANSFERASE | ||
Line 43: | Line 43: | ||
* After deletion of '#' or '?x' all blanks are deleted from all sequences. | * After deletion of '#' or '?x' all blanks are deleted from all sequences. | ||
− | Here's an example with 3 label sequences where one (?1) shows the DSSP secondary structure annotation and the other (?2) shows the helices only: | + | Here's an example with 3 label sequences where one (?1) shows the [[DSSP]] secondary structure annotation and the other (?2) shows the helices only: |
<pre> | <pre> | ||
>1IRK._ TRANSFERASE | >1IRK._ TRANSFERASE | ||
Line 75: | Line 75: | ||
==LFasta related Scripts== | ==LFasta related Scripts== | ||
see: [http://www.binf.ku.dk/~kasper/wiki/ScriptList.html Script List] for complete list. | see: [http://www.binf.ku.dk/~kasper/wiki/ScriptList.html Script List] for complete list. | ||
− | ;reformat.pl : This script does reformatting between sequence formats. It handles GenBank, EMBL, | + | ;reformat.pl : This script does reformatting between sequence formats. It handles [[GenBank]], EMBL, [[FASTA format|FASTA]] and all the other formats supported by [[BioPerl]]. In addition it formats to ''labeled fasta'' (lfa) which is the a handy extension of the FASTA format developed by Anders Krogh for use in HMM training. The labeling is generated from the sequence features in a manner directed by the <code>--labelkey</code> option. The information surplus or deficit when formatting between rich formats like EMBL and FASTA can be handled by using the <code>gff</code> option. This specifies a [[GFF]] file that is read from or written to depending on the which way the formatting goes. |
− | ;grepseq.pl : Extract sub-sequences from sequences on | + | ;grepseq.pl : Extract sub-sequences from sequences on STDIN based on a ([[Perl]]) [[regular expression]] given on the CLI. Input sequences in labeled fasta (lfa) format. By default the labels are searched using the regexp. Note that the IDs on the output are made unique by adding an incrementing suffix for each match in an entry. This can be avoided by using the <code>keepid</code> option. |
− | ;addprediction.pl : This script adds a prediction track to | + | ;addprediction.pl : This script adds a prediction track to LFasta (lfa) entries as specified by a GFF file. This is useful for comparing predictions. |
− | ;untangle.pl : This script untangles | + | ;untangle.pl : This script untangles LFasta (lfa) as it comes out if you treat it as ordinary FASTA in a Seq or SeqIO object. |
==LFasta Modules== | ==LFasta Modules== |
Revision as of 07:24, 5 August 2007
The Labeled Fasta (LFasta / lfa) format was invented by Anders Krogh. It is a variant of the FASTA format.
Contents
Syntax
Note: The description below was taken directly from his homepage.
In may applications of biological sequence analysis some label is associated with each letter in a sequence. For instance for secondary structure of proteins you may put an 'H' for an alpha helix, an 'E' for (extended) beta sheet and say 'x' for anything else. The 'Labeled FASTA format' allows for a string of such labels (or more than one string). For the secondary structure example it would look like this:
>1IRK._ TRANSFERASE SSVFVPDEWEVSREKITLLRELGQGSFGMVYEGNARDIIKGEAETRVAVKTVNESASLRERIEFLNEASV # xxxxxxxxxxxxxxxEEEEEEEEExxxxEEEEEEEEEEExxEEEEEEEEExxxxxxxHHHHHHHHHHHHH MKGFTCHHVVRLLGVVSKGQPTLVVMELMAHGDLKSYLRSLRPEAENNPGRPPPTLQEMIQMAAEIADGM # HxxxxxxxxxxEEEEExxxxxxEEEEExxxxxxHHHHHHHxxxxxxxxxxxxxxxHHHHHHHHHHHHHHH AYLNAKKFVHRDLAARNCMVAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPVRWMAPESLKDGVFTTSS # HHHHHxxxxxxxxxxxxEEExxxxxEEExxxxxxxxxxxxxxEExxxxxEExxxxxxHHHHHHxxxxHHH DMWSFGVVLWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDNCPERVTDLMRMCWQFNPKMRPTFLEIV # HHHHHHHHHHHHHHxxxxxxxxxxHHHHHHHHHxxxxxxxxxxxxHHHHHHHHHHxxxxxxxxxxHHHHH NLLKDDLHPSFPEVSFFHSEENK # HHHxxxxxxxHHHHxxxxxxxxx
Here the '#' precedes the sequence of labels. It doesn't matter whether the entire sequence comes before the labels or if the lines are mixed, so this for instance would do as well:
>1IRK._ TRANSFERASE SSVFVPDEWEVSREKITLLRELGQGSFGMVYEGNARDIIKGEAETRVAVKTVNESASLRERIEFLNEASV MKGFTCHHVVRLLGVVSKGQPTLVVMELMAHGDLKSYLRSLRPEAENNPGRPPPTLQEMIQMAAEIADGM AYLNAKKFVHRDLAARNCMVAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPVRWMAPESLKDGVFTTSS # xxxxxxxxxxxxxxxEEEEEEEEExxxxEEEEEEEEEEExxEEEEEEEEExxxxxxxHHHHHHHHHHHHH # HxxxxxxxxxxEEEEExxxxxxEEEEExxxxxxHHHHHHHxxxxxxxxxxxxxxxHHHHHHHHHHHHHHH # HHHHHxxxxxxxxxxxxEEExxxxxEEExxxxxxxxxxxxxxEExxxxxEExxxxxxHHHHHHxxxxHHH DMWSFGVVLWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDNCPERVTDLMRMCWQFNPKMRPTFLEIV # HHHHHHHHHHHHHHxxxxxxxxxxHHHHHHHHHxxxxxxxxxxxxHHHHHHHHHHxxxxxxxxxxHHHHH NLLKDDLHPSFPEVSFFHSEENK # HHHxxxxxxxHHHHxxxxxxxxx
The label sequence and the protein (or DNA) sequence must have the same length.
Format Specification
- Each entry starts with a '>' as the first character on a line immediately followed by the name of the entry (like FASTA format). The rest of the first line is ignored. An entry is terminated by EOF or a new entry ('>').
- The following lines contain a number of sequences of the same length.
- Lines starting with '#' are 'primary' labels.
- Lines starting with '?' followed by a letter are other labels identified by that letter.
- Lines starting with '%' are comment lines (ignored).
- Other lines contain the primary sequence (usually protein, DNA or RNA).
- After deletion of '#' or '?x' all blanks are deleted from all sequences.
Here's an example with 3 label sequences where one (?1) shows the DSSP secondary structure annotation and the other (?2) shows the helices only:
>1IRK._ TRANSFERASE SSVFVPDEWEVSREKITLLRELGQGSFGMVYEGNARDIIKGEAETRVAVKTVNESASLRERIEFLNEASV # xxxxxxxxxxxxxxxEEEEEEEEExxxxEEEEEEEEEEExxEEEEEEEEExxxxxxxHHHHHHHHHHHHH ?1 ......GGGB..GGGEEEEEEEEE.SSSEEEEEEEEEEETTEEEEEEEEE...TT..HHHHHHHHHHHHH ?2 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxHHHHHHHHHHHHH MKGFTCHHVVRLLGVVSKGQPTLVVMELMAHGDLKSYLRSLRPEAENNPGRPPPTLQEMIQMAAEIADGM # HxxxxxxxxxxEEEEExxxxxxEEEEExxxxxxHHHHHHHxxxxxxxxxxxxxxxHHHHHHHHHHHHHHH ?1 HTT...TTB..EEEEE.SSSS.EEEEE..TT.BHHHHHHHTSTT.TT..S..S..HHHHHHHHHHHHHHH ?2 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxHHHHHHHxxxxxxxxxxxxxxxHHHHHHHHHHHHHHH AYLNAKKFVHRDLAARNCMVAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPVRWMAPESLKDGVFTTSS # HHHHHxxxxxxxxxxxxEEExxxxxEEExxxxxxxxxxxxxxEExxxxxEExxxxxxHHHHHHxxxxHHH ?1 HHHHHTT...S..SGGGEEE.TT..EEE...S.SSSTTGGG.EEGGGSSEE.GGG..HHHHHH....HHH ?2 HHHHHxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxHHHHHHxxxxHHH DMWSFGVVLWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDNCPERVTDLMRMCWQFNPKMRPTFLEIV # HHHHHHHHHHHHHHxxxxxxxxxxHHHHHHHHHxxxxxxxxxxxxHHHHHHHHHHxxxxxxxxxxHHHHH ?1 HHHHHHHHHHHHHHTS..TTTTS.HHHHHHHHHTT......SS..HHHHHHHHHHT.SSGGGS..HHHHH ?2 HHHHHHHHHHHHHHxxxxxxxxxxHHHHHHHHHxxxxxxxxxxxxHHHHHHHHHHxxxxxxxxxxHHHHH NLLKDDLHPSFPEVSFFHSEENK # HHHxxxxxxxHHHHxxxxxxxxx ?1 HHHGGGS.TTHHHH.STTSTT.. ?2 HHHxxxxxxxHHHHxxxxxxxxx
see: Script List for complete list.
- reformat.pl
- This script does reformatting between sequence formats. It handles GenBank, EMBL, FASTA and all the other formats supported by BioPerl. In addition it formats to labeled fasta (lfa) which is the a handy extension of the FASTA format developed by Anders Krogh for use in HMM training. The labeling is generated from the sequence features in a manner directed by the
--labelkey
option. The information surplus or deficit when formatting between rich formats like EMBL and FASTA can be handled by using thegff
option. This specifies a GFF file that is read from or written to depending on the which way the formatting goes. - grepseq.pl
- Extract sub-sequences from sequences on STDIN based on a (Perl) regular expression given on the CLI. Input sequences in labeled fasta (lfa) format. By default the labels are searched using the regexp. Note that the IDs on the output are made unique by adding an incrementing suffix for each match in an entry. This can be avoided by using the
keepid
option. - addprediction.pl
- This script adds a prediction track to LFasta (lfa) entries as specified by a GFF file. This is useful for comparing predictions.
- untangle.pl
- This script untangles LFasta (lfa) as it comes out if you treat it as ordinary FASTA in a Seq or SeqIO object.
LFasta Modules
- LFasta
- A LFasta (L for labeled) object is a sequence with sequence features placed on it. The LFasta format is a hybrid between the simple Fasta format and the rich formats such at GenBank, EMBL and Swissprot. Along with the sequence it holds any information that maps directly to the plus strand of the sequence. The features are held on one or more label lines for each sequence line. A letter represents a type for feature. Eg. E for exons, H for helix and so on. This gives LFasta the "grepability" of the Fasta format and a sequence feature richness comparable to the rich Seq formats. More...
- LFastaIO
- LFastaIO is to LFasta what SeqIO is to Seq. It works in much the same way, but does only support the filehandel emulation for input and output. So LFastaIO->new corresponds to Bio::SeqIO->newFh. As of now the module only supports LFasta as input format. For output formats other than LFasta, it uses the facilities of SeqIO. More...
See also
External links
- Professor Anders Krogh
- Labeled Fasta (LFasta) format — by Kasper Munch