Difference between revisions of "LFasta"
(→LFasta related Scripts) |
|||
Line 74: | Line 74: | ||
==LFasta related Scripts== | ==LFasta related Scripts== | ||
+ | see: [http://www.binf.ku.dk/~kasper/wiki/ScriptList.html Script List] for complete list. | ||
;reformat.pl : This script does reformatting between sequence formats. It handles GenBank, EMBL, Fasta and all the other formats supported by bioperl. In addition it formats to labeled fasta (lfa) which is the a handy extention of the fasta format developed by Anders Krogh for use in HMM training. The labeling is generated from the sequence features in a manner directed by the --labelkey option. The information surplus or deficit when formatting between rich formats like EMBL and Fasta can be handled by using the gff option. This specifies a gff file that is read from or written to depending on the which way the formatting goes. | ;reformat.pl : This script does reformatting between sequence formats. It handles GenBank, EMBL, Fasta and all the other formats supported by bioperl. In addition it formats to labeled fasta (lfa) which is the a handy extention of the fasta format developed by Anders Krogh for use in HMM training. The labeling is generated from the sequence features in a manner directed by the --labelkey option. The information surplus or deficit when formatting between rich formats like EMBL and Fasta can be handled by using the gff option. This specifies a gff file that is read from or written to depending on the which way the formatting goes. | ||
;grepseq.pl : Extract sub-sequences from sequences on stdin based on a (perl) regular expression given on the cmd line. Input sequences in labeled fasta format. By default the labels are searched using the regexp. Note that the IDs on the output are made unique by adding an incrementing suffix for each match in an entry. This can be avoided by using the keepid option. | ;grepseq.pl : Extract sub-sequences from sequences on stdin based on a (perl) regular expression given on the cmd line. Input sequences in labeled fasta format. By default the labels are searched using the regexp. Note that the IDs on the output are made unique by adding an incrementing suffix for each match in an entry. This can be avoided by using the keepid option. | ||
;addprediction.pl : This script adds a prediction track to labeled Fasta entries as specified by a gff file. This is usefull for comaparing predictions. | ;addprediction.pl : This script adds a prediction track to labeled Fasta entries as specified by a gff file. This is usefull for comaparing predictions. | ||
− | ;untangle.pl : This script untangles Labeled Fasta as it comes out if you treat it as ordinary Fasta in a Seq or SeqIO object. | + | ;untangle.pl : This script untangles Labeled Fasta as it comes out if you treat it as ordinary Fasta in a Seq or SeqIO object. |
==LFasta Modules== | ==LFasta Modules== |
Revision as of 07:17, 5 August 2007
The Labeled Fasta (LFasta) format was invented by Anders Krogh. It is a variant of the FASTA format.
Contents
Syntax
Note: The description below was taken directly from his homepage.
In may applications of biological sequence analysis some label is associated with each letter in a sequence. For instance for secondary structure of proteins you may put an 'H' for an alpha helix, an 'E' for (extended) beta sheet and say 'x' for anything else. The 'Labeled FASTA format' allows for a string of such labels (or more than one string). For the secondary structure example it would look like this:
>1IRK._ TRANSFERASE SSVFVPDEWEVSREKITLLRELGQGSFGMVYEGNARDIIKGEAETRVAVKTVNESASLRERIEFLNEASV # xxxxxxxxxxxxxxxEEEEEEEEExxxxEEEEEEEEEEExxEEEEEEEEExxxxxxxHHHHHHHHHHHHH MKGFTCHHVVRLLGVVSKGQPTLVVMELMAHGDLKSYLRSLRPEAENNPGRPPPTLQEMIQMAAEIADGM # HxxxxxxxxxxEEEEExxxxxxEEEEExxxxxxHHHHHHHxxxxxxxxxxxxxxxHHHHHHHHHHHHHHH AYLNAKKFVHRDLAARNCMVAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPVRWMAPESLKDGVFTTSS # HHHHHxxxxxxxxxxxxEEExxxxxEEExxxxxxxxxxxxxxEExxxxxEExxxxxxHHHHHHxxxxHHH DMWSFGVVLWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDNCPERVTDLMRMCWQFNPKMRPTFLEIV # HHHHHHHHHHHHHHxxxxxxxxxxHHHHHHHHHxxxxxxxxxxxxHHHHHHHHHHxxxxxxxxxxHHHHH NLLKDDLHPSFPEVSFFHSEENK # HHHxxxxxxxHHHHxxxxxxxxx
Here the '#' preceeds the sequence of labels. It doesn't matter whether the entire sequence comes before the labels or if the lines are mixed, so this for instance would do as well:
>1IRK._ TRANSFERASE SSVFVPDEWEVSREKITLLRELGQGSFGMVYEGNARDIIKGEAETRVAVKTVNESASLRERIEFLNEASV MKGFTCHHVVRLLGVVSKGQPTLVVMELMAHGDLKSYLRSLRPEAENNPGRPPPTLQEMIQMAAEIADGM AYLNAKKFVHRDLAARNCMVAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPVRWMAPESLKDGVFTTSS # xxxxxxxxxxxxxxxEEEEEEEEExxxxEEEEEEEEEEExxEEEEEEEEExxxxxxxHHHHHHHHHHHHH # HxxxxxxxxxxEEEEExxxxxxEEEEExxxxxxHHHHHHHxxxxxxxxxxxxxxxHHHHHHHHHHHHHHH # HHHHHxxxxxxxxxxxxEEExxxxxEEExxxxxxxxxxxxxxEExxxxxEExxxxxxHHHHHHxxxxHHH DMWSFGVVLWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDNCPERVTDLMRMCWQFNPKMRPTFLEIV # HHHHHHHHHHHHHHxxxxxxxxxxHHHHHHHHHxxxxxxxxxxxxHHHHHHHHHHxxxxxxxxxxHHHHH NLLKDDLHPSFPEVSFFHSEENK # HHHxxxxxxxHHHHxxxxxxxxx
The label sequence and the protein (or DNA) sequence must have the same length.
Format Specification
- Each entry starts with a '>' as the first character on a line immediately followed by the name of the entry (like FASTA format). The rest of the first line is ignored. An entry is terminated by EOF or a new entry ('>').
- The following lines contain a number of sequences of the same length.
- Lines starting with '#' are 'primary' labels.
- Lines starting with '?' followed by a letter are other labels identified by that letter.
- Lines starting with '%' are comment lines (ignored).
- Other lines contain the primary sequence (usually protein, DNA or RNA).
- After deletion of '#' or '?x' all blanks are deleted from all sequences.
Here's an example with 3 label sequences where one (?1) shows the DSSP secondary structure annotation and the other (?2) shows the helices only:
>1IRK._ TRANSFERASE SSVFVPDEWEVSREKITLLRELGQGSFGMVYEGNARDIIKGEAETRVAVKTVNESASLRERIEFLNEASV # xxxxxxxxxxxxxxxEEEEEEEEExxxxEEEEEEEEEEExxEEEEEEEEExxxxxxxHHHHHHHHHHHHH ?1 ......GGGB..GGGEEEEEEEEE.SSSEEEEEEEEEEETTEEEEEEEEE...TT..HHHHHHHHHHHHH ?2 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxHHHHHHHHHHHHH MKGFTCHHVVRLLGVVSKGQPTLVVMELMAHGDLKSYLRSLRPEAENNPGRPPPTLQEMIQMAAEIADGM # HxxxxxxxxxxEEEEExxxxxxEEEEExxxxxxHHHHHHHxxxxxxxxxxxxxxxHHHHHHHHHHHHHHH ?1 HTT...TTB..EEEEE.SSSS.EEEEE..TT.BHHHHHHHTSTT.TT..S..S..HHHHHHHHHHHHHHH ?2 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxHHHHHHHxxxxxxxxxxxxxxxHHHHHHHHHHHHHHH AYLNAKKFVHRDLAARNCMVAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPVRWMAPESLKDGVFTTSS # HHHHHxxxxxxxxxxxxEEExxxxxEEExxxxxxxxxxxxxxEExxxxxEExxxxxxHHHHHHxxxxHHH ?1 HHHHHTT...S..SGGGEEE.TT..EEE...S.SSSTTGGG.EEGGGSSEE.GGG..HHHHHH....HHH ?2 HHHHHxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxHHHHHHxxxxHHH DMWSFGVVLWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDNCPERVTDLMRMCWQFNPKMRPTFLEIV # HHHHHHHHHHHHHHxxxxxxxxxxHHHHHHHHHxxxxxxxxxxxxHHHHHHHHHHxxxxxxxxxxHHHHH ?1 HHHHHHHHHHHHHHTS..TTTTS.HHHHHHHHHTT......SS..HHHHHHHHHHT.SSGGGS..HHHHH ?2 HHHHHHHHHHHHHHxxxxxxxxxxHHHHHHHHHxxxxxxxxxxxxHHHHHHHHHHxxxxxxxxxxHHHHH NLLKDDLHPSFPEVSFFHSEENK # HHHxxxxxxxHHHHxxxxxxxxx ?1 HHHGGGS.TTHHHH.STTSTT.. ?2 HHHxxxxxxxHHHHxxxxxxxxx
see: Script List for complete list.
- reformat.pl
- This script does reformatting between sequence formats. It handles GenBank, EMBL, Fasta and all the other formats supported by bioperl. In addition it formats to labeled fasta (lfa) which is the a handy extention of the fasta format developed by Anders Krogh for use in HMM training. The labeling is generated from the sequence features in a manner directed by the --labelkey option. The information surplus or deficit when formatting between rich formats like EMBL and Fasta can be handled by using the gff option. This specifies a gff file that is read from or written to depending on the which way the formatting goes.
- grepseq.pl
- Extract sub-sequences from sequences on stdin based on a (perl) regular expression given on the cmd line. Input sequences in labeled fasta format. By default the labels are searched using the regexp. Note that the IDs on the output are made unique by adding an incrementing suffix for each match in an entry. This can be avoided by using the keepid option.
- addprediction.pl
- This script adds a prediction track to labeled Fasta entries as specified by a gff file. This is usefull for comaparing predictions.
- untangle.pl
- This script untangles Labeled Fasta as it comes out if you treat it as ordinary Fasta in a Seq or SeqIO object.
LFasta Modules
- LFasta
- A LFasta (L for labeled) object is a sequence with sequence features placed on it. The LFasta format is a hybrid between the simple Fasta format and the rich formats such at GenBank, EMBL and Swissprot. Along with the sequence it holds any information that maps directly to the plus strand of the sequence. The features are held on one or more label lines for each sequence line. A letter represents a type for feature. Eg. E for exons, H for helix and so on. This gives LFasta the "grepability" of the Fasta format and a sequence feature richness comparable to the rich Seq formats. More...
- LFastaIO
- LFastaIO is to LFasta what SeqIO is to Seq. It works in much the same way, but does only support the filehandel emulation for input and output. So LFastaIO->new corresponds to Bio::SeqIO->newFh. As of now the module only supports LFasta as input format. For output formats other than LFasta, it uses the facilities of SeqIO. More...
See also
External links
- Professor Anders Krogh
- Labeled Fasta (LFasta) format — by Kasper Munch