LFasta

From Christoph's Personal Wiki
Revision as of 01:39, 13 July 2012 by Christoph (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The Labeled Fasta (LFasta / lfa) format was invented by Anders Krogh. It is a variant of the FASTA format.

Syntax

Note: The description below was taken directly from his homepage.

In may applications of biological sequence analysis some label is associated with each letter in a sequence. For instance for secondary structure of proteins you may put an 'H' for an alpha helix, an 'E' for (extended) beta sheet and say 'x' for anything else. The 'Labeled FASTA format' allows for a string of such labels (or more than one string). For the secondary structure example it would look like this:

>1IRK._ TRANSFERASE
   SSVFVPDEWEVSREKITLLRELGQGSFGMVYEGNARDIIKGEAETRVAVKTVNESASLRERIEFLNEASV
#  xxxxxxxxxxxxxxxEEEEEEEEExxxxEEEEEEEEEEExxEEEEEEEEExxxxxxxHHHHHHHHHHHHH
   MKGFTCHHVVRLLGVVSKGQPTLVVMELMAHGDLKSYLRSLRPEAENNPGRPPPTLQEMIQMAAEIADGM
#  HxxxxxxxxxxEEEEExxxxxxEEEEExxxxxxHHHHHHHxxxxxxxxxxxxxxxHHHHHHHHHHHHHHH
   AYLNAKKFVHRDLAARNCMVAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPVRWMAPESLKDGVFTTSS
#  HHHHHxxxxxxxxxxxxEEExxxxxEEExxxxxxxxxxxxxxEExxxxxEExxxxxxHHHHHHxxxxHHH
   DMWSFGVVLWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDNCPERVTDLMRMCWQFNPKMRPTFLEIV
#  HHHHHHHHHHHHHHxxxxxxxxxxHHHHHHHHHxxxxxxxxxxxxHHHHHHHHHHxxxxxxxxxxHHHHH
   NLLKDDLHPSFPEVSFFHSEENK
#  HHHxxxxxxxHHHHxxxxxxxxx

Here the '#' precedes the sequence of labels. It doesn't matter whether the entire sequence comes before the labels or if the lines are mixed, so this for instance would do as well:

>1IRK._ TRANSFERASE
   SSVFVPDEWEVSREKITLLRELGQGSFGMVYEGNARDIIKGEAETRVAVKTVNESASLRERIEFLNEASV
   MKGFTCHHVVRLLGVVSKGQPTLVVMELMAHGDLKSYLRSLRPEAENNPGRPPPTLQEMIQMAAEIADGM
   AYLNAKKFVHRDLAARNCMVAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPVRWMAPESLKDGVFTTSS
#  xxxxxxxxxxxxxxxEEEEEEEEExxxxEEEEEEEEEEExxEEEEEEEEExxxxxxxHHHHHHHHHHHHH
#  HxxxxxxxxxxEEEEExxxxxxEEEEExxxxxxHHHHHHHxxxxxxxxxxxxxxxHHHHHHHHHHHHHHH
#  HHHHHxxxxxxxxxxxxEEExxxxxEEExxxxxxxxxxxxxxEExxxxxEExxxxxxHHHHHHxxxxHHH
   DMWSFGVVLWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDNCPERVTDLMRMCWQFNPKMRPTFLEIV
#  HHHHHHHHHHHHHHxxxxxxxxxxHHHHHHHHHxxxxxxxxxxxxHHHHHHHHHHxxxxxxxxxxHHHHH
   NLLKDDLHPSFPEVSFFHSEENK
#  HHHxxxxxxxHHHHxxxxxxxxx

The label sequence and the protein (or DNA) sequence must have the same length.

Format specification

  • Each entry starts with a '>' as the first character on a line immediately followed by the name of the entry (like FASTA format). The rest of the first line is ignored. An entry is terminated by EOF or a new entry ('>').
  • The following lines contain a number of sequences of the same length.
  • Lines starting with '#' are 'primary' labels.
  • Lines starting with '?' followed by a letter are other labels identified by that letter.
  • Lines starting with '%' are comment lines (ignored).
  • Other lines contain the primary sequence (usually protein, DNA or RNA).
  • After deletion of '#' or '?x' all blanks are deleted from all sequences.

Here's an example with 3 label sequences where one (?1) shows the DSSP secondary structure annotation and the other (?2) shows the helices only:

>1IRK._ TRANSFERASE

   SSVFVPDEWEVSREKITLLRELGQGSFGMVYEGNARDIIKGEAETRVAVKTVNESASLRERIEFLNEASV
#  xxxxxxxxxxxxxxxEEEEEEEEExxxxEEEEEEEEEEExxEEEEEEEEExxxxxxxHHHHHHHHHHHHH
?1 ......GGGB..GGGEEEEEEEEE.SSSEEEEEEEEEEETTEEEEEEEEE...TT..HHHHHHHHHHHHH
?2 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxHHHHHHHHHHHHH

   MKGFTCHHVVRLLGVVSKGQPTLVVMELMAHGDLKSYLRSLRPEAENNPGRPPPTLQEMIQMAAEIADGM
#  HxxxxxxxxxxEEEEExxxxxxEEEEExxxxxxHHHHHHHxxxxxxxxxxxxxxxHHHHHHHHHHHHHHH
?1 HTT...TTB..EEEEE.SSSS.EEEEE..TT.BHHHHHHHTSTT.TT..S..S..HHHHHHHHHHHHHHH
?2 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxHHHHHHHxxxxxxxxxxxxxxxHHHHHHHHHHHHHHH

   AYLNAKKFVHRDLAARNCMVAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPVRWMAPESLKDGVFTTSS
#  HHHHHxxxxxxxxxxxxEEExxxxxEEExxxxxxxxxxxxxxEExxxxxEExxxxxxHHHHHHxxxxHHH
?1 HHHHHTT...S..SGGGEEE.TT..EEE...S.SSSTTGGG.EEGGGSSEE.GGG..HHHHHH....HHH
?2 HHHHHxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxHHHHHHxxxxHHH

   DMWSFGVVLWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDNCPERVTDLMRMCWQFNPKMRPTFLEIV
#  HHHHHHHHHHHHHHxxxxxxxxxxHHHHHHHHHxxxxxxxxxxxxHHHHHHHHHHxxxxxxxxxxHHHHH
?1 HHHHHHHHHHHHHHTS..TTTTS.HHHHHHHHHTT......SS..HHHHHHHHHHT.SSGGGS..HHHHH
?2 HHHHHHHHHHHHHHxxxxxxxxxxHHHHHHHHHxxxxxxxxxxxxHHHHHHHHHHxxxxxxxxxxHHHHH

   NLLKDDLHPSFPEVSFFHSEENK
#  HHHxxxxxxxHHHHxxxxxxxxx
?1 HHHGGGS.TTHHHH.STTSTT..
?2 HHHxxxxxxxHHHHxxxxxxxxx

LFasta related scripts

see: Script List for complete list.
reformat.pl 
This script does reformatting between sequence formats. It handles GenBank, EMBL, FASTA and all the other formats supported by BioPerl. In addition it formats to labeled fasta (lfa) which is the a handy extension of the FASTA format developed by Anders Krogh for use in HMM training. The labeling is generated from the sequence features in a manner directed by the --labelkey option. The information surplus or deficit when formatting between rich formats like EMBL and FASTA can be handled by using the gff option. This specifies a GFF file that is read from or written to depending on the which way the formatting goes.
grepseq.pl 
Extract sub-sequences from sequences on STDIN based on a (Perl) regular expression given on the CLI. Input sequences in labeled fasta (lfa) format. By default the labels are searched using the regexp. Note that the IDs on the output are made unique by adding an incrementing suffix for each match in an entry. This can be avoided by using the keepid option.
addprediction.pl 
This script adds a prediction track to LFasta (lfa) entries as specified by a GFF file. This is useful for comparing predictions.
untangle.pl 
This script untangles LFasta (lfa) as it comes out if you treat it as ordinary FASTA in a Seq or SeqIO object.

LFasta modules

LFasta 
A LFasta (L for labeled) object is a sequence with sequence features placed on it. The LFasta format is a hybrid between the simple Fasta format and the rich formats such at GenBank, EMBL and Swissprot. Along with the sequence it holds any information that maps directly to the plus strand of the sequence. The features are held on one or more label lines for each sequence line. A letter represents a type for feature. Eg. E for exons, H for helix and so on. This gives LFasta the "grepability" of the Fasta format and a sequence feature richness comparable to the rich Seq formats. More...
LFastaIO 
LFastaIO is to LFasta what SeqIO is to Seq. It works in much the same way, but does only support the filehandel emulation for input and output. So LFastaIO->new corresponds to Bio::SeqIO->newFh. As of now the module only supports LFasta as input format. For output formats other than LFasta, it uses the facilities of SeqIO. More...

See also

External links