Difference between revisions of "HOW file format"

Latest revision as of 03:07, 24 June 2007

The HOW file format (or just HOW file).

Data format

Note: The following is taken directly from Søren Brunak's how man page.

For input data the how program uses a specially developed sequence format common to nucleotide and amino acid sequences. Traditionally, the files are given names with the extensions .seq or .how. The format is described below.

A HOW file consists of entries containing one sequence each. Each entry has the following three parts:

Header line with the following syntax:
- Sequence length, in 6 positions, right adjusted;
- Exactly one SPACE;
- Sequence name, in at most 20 positions, left adjusted;
- Optional comment, separated from the sequence name by whitespace.
Sequence data, in one-letter amino acid or nucleotide code in lines of exactly 80 symbols. Only the last line may be shorter than 80 symbols.
Assignment data, in lines of exactly 80 characters. Only the last line may be shorter than 80 symbols. Each symbol in the assignment data is considered to answer to exactly one sequence residue. Thus, the number of the assignment symbols must be equal to the sequence length.

Both the sequence and assignment data lines may contain other data on the positions from 81 on (e.g., line numbering and comments). Such data is ignored by the how program.

Secondary structure format (DSSP)

The secondary structure format uses the DSSP assignment^[1]:

G - 3-10 helix
I - pi-helix
H - alpha-helix
E - extended beta-sheet
B - beta-bridge
S - bend
L - other/loop
. - unassigned

Note: Prediction servers typically use just three categories H, E, and L, where L is the rest. Sometimes H, G, and I are merged to H and sometimes E and B are merged to E, and L means the rest just as for the PredictProtein server.

Example HOW file

      178 1cdy.-
KKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKSPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSD      80
TYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCT     160
VLQNQKKVEFKIDIVVLA
.EEEEEETTS.EEE..B..SSSS..EEEEETTS.EEEEEETTEEEE.S.TTGGGEE..GGGGGGTB..EEE.S..GGG.E      80
EEEEEETTEEEEEEEEEEEEEE.S.SEEETT..EEEEEE..TT....B..B.TTS.B..BSSEEEESS..STT.EEEEEE     160
EEETTEEEEEEEEEEEE.
      405 1phb.-
NLAPLPPHVPEHLVFDFDMYNPSNLSAGVQEAWAVLQESNVPDLVWTRCNGGHWIATRGQLIREAYEDYRHFSSECPFIP      80
REAGEAYDFIPTSMDPPEQRQFRALANQVVGMPVVDKLENRIQELACSLIESLRPQGQCNFTEDYAEPFPIRIFMLLAGL     160
PEEDIPHLKYLTDQMTRPDGSMTFAEAKEALYDYLIPIIEQRRQKPGTDAISIVANGQVNGRPITSDEAKRMCGLLLVGG     240
LDTVVNFLSFSMEFLAKSPEHRQELIERPERIPAACEELLRRFSLVADGRILTSDYEFHGVQLKKGDQILLPQMLSGLDE     320
RENACPMHVDFSRQKVSHTTFGHGSHLCLGQHLARREIIVTLKEWLTRIPDFSIAPGAQIQHKSGIVSGVQALPLVWDPA     400
TTKAV
......TTS.GGGB....TTS.TTGGG.HHHHHGGGGSTTS.SEEEE.GGG.EEEE.SHHHHHHHHH.TTTEETTS.SSS      80
HHHHHH...TTTT..TTTHHHHHHHHHHHHSHHHHHHHHHHHHHHHHHHHHHHGGGSEEEHHHHTTTHHHHHHHHHHHT.     160
.GGGHHHHHHHHHHHHS..SSS.HHHHHHHHHHHHHHHHHHHHHS..SSHHHHHHT.EETTEE..HHHHHHHHHHHHHHH     240
HHHHHHHHHHHHHHHHH.HHHHHHHHH.GGGHHHHHHHHHHHT..B..EEEESS.EEETTEEE.TT.EEE..GGGTTT.T     320
TTSSSTTS..TT.S.....TT..GGG..TTHHHHHHHHHHHHHHHHHH....EE.TT....EE.SSB.EES..EEE..GG     400
G....
      344 1hle.A
MEQLSTANTHFAVDLFRALNESDPTGNIFISPLSISSALAMIFLGTRGNTAAQVSKALYFDTVEDIHSRFQSLNADINKP      80
GAPYILKLANRLYGEKTYNFLADFLASTQKMYGAELASVDFQQAPEDARKEINEWVKGQTEGKIPELLVKGMVDNMTKLV     160
LVNAIYFKGNWQQKFMKEATRDAPFRLNKKDTKTVKMMYQKKKFPYNYIEDLKCRVLELPYQGKELSMIILLPDDIEDES     240
TGLEKIEKQLTLDKLREWTKPENLYLAEVNVHLPRFKLEESYDLTSHLARLGVQDLFNRGKADLSGMSGARDLFVSKIIH     320
KSFVDLNEEGTEAAAATAGTILLA
.HHHHHHHHHHHHHHHHHHHHH.SSS.EEE.HHHHHHHHHHHHHT..HHHHHHHHHHHTGGGSTTHHHHHHHHHHHHT.S      80
S.SSEEEEEEEEEEETT....HHHHHHHHHHH..EEEEE.TTT.HHHHHHHHHHHHHHHTTTSSS.SS.TTSS.TTEEEE     160
EEEEEEEEEEBSS...GGG.EEEEEESSSS.EEEEEEEEEEEEEEEEEEGGGTEEEEEEEBTTSSEEEEEEEESS..SSS     240
SS.HHHHHT..HHHHHHHH.GGG.EEEEEEEEEE.EEEEEEEE.HHHHHHHT..GGG.TTT...HHHHSSS.EEEEEEEE     320
EEEEEE.SSEEEEEEEEEEEEEE.

Note: In the "old style" HOW format for output, the sequence length will occupy the first five positions of the header line, right adjusted. The new HOW format reserves the first six positions for the sequence length.

References

↑ Kabsch W, Sander C (1983). "Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features". Biopolymers, 22(12):2577-2637. PMID: 6667333.

External links

STRING — Search Tool for the Retrieval of Interacting Proteins

[Kabsch1983-1] Kabsch W, Sander C (1983). "Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features". Biopolymers, 22(12):2577-2637. PMID: 6667333.

[1]

@@ Line 1: / Line 1: @@
 The '''HOW file format''' (or just '''HOW file''').
+==Data format==
+''Note: The following is taken directly from Søren Brunak's how man page.''
+For input data the how program uses a specially developed sequence format common to nucleotide and amino acid sequences. Traditionally, the files are given names with the extensions <code>.seq</code> or <code>.how</code>. The format is described below.
+A HOW file consists of entries containing one sequence each. Each entry has the following three parts:
+#Header line with the following syntax:
+#*Sequence length, in 6 positions, right adjusted;
+#*Exactly one SPACE;
+#*Sequence name, in at most 20 positions, left adjusted;
+#*Optional comment, separated from the sequence name by whitespace.
+#Sequence data, in one-letter amino acid or nucleotide code in lines of exactly 80 symbols. Only the last line may be shorter than 80 symbols.
+#Assignment data, in lines of exactly 80 characters. Only the last line	may be shorter than 80 symbols.	Each symbol in the assignment data is considered to answer to exactly one sequence residue. Thus, the number of the assignment symbols must be equal to the sequence length.
+Both the sequence and assignment data lines may contain other data on the positions from 81 on (e.g., line numbering and comments). Such data is ignored by the how program.
 ==Secondary structure format (DSSP)==
-The secondary structure format uses the DSSP assignment:
+The secondary structure format uses the [[DSSP]] assignment<ref name="Kabsch1983">Kabsch W, Sander C (1983). "Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features". ''Biopolymers, 22(12):2577-2637. PMID: 6667333''.</ref>:
-  G	3-10 helix
+  G - 3-10 helix
-  I	pi-helix
+  I - pi-helix
-  H	alpha-helix
+  H - alpha-helix
-  E	extended beta-sheet
+  E - extended beta-sheet
-  B	beta-bridge
+  B - beta-bridge
-  S	bend
+  S - bend
-  L	other/loop
+  L - other/loop
+ . - unassigned
 ''Note: Prediction servers typically use just three categories H, E, and L, where L is the rest. Sometimes H, G, and I are merged to H and sometimes E and B are merged to E, and L means the rest just as for the [http://www.predictprotein.org/ PredictProtein server].''
@@ Line 15: / Line 32: @@
 ==Example HOW file==
 <pre>
 1cdy.-
 KKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKSPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSD      80
 TYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCT     160
@@ Line 22: / Line 39: @@
 EEEEEETTEEEEEEEEEEEEEE.S.SEEETT..EEEEEE..TT....B..B.TTS.B..BSSEEEESS..STT.EEEEEE     160
 EEETTEEEEEEEEEEEE.
 1phb.-
 NLAPLPPHVPEHLVFDFDMYNPSNLSAGVQEAWAVLQESNVPDLVWTRCNGGHWIATRGQLIREAYEDYRHFSSECPFIP      80
 REAGEAYDFIPTSMDPPEQRQFRALANQVVGMPVVDKLENRIQELACSLIESLRPQGQCNFTEDYAEPFPIRIFMLLAGL     160
@@ Line 35: / Line 52: @@
 TTSSSTTS..TT.S.....TT..GGG..TTHHHHHHHHHHHHHHHHHH....EE.TT....EE.SSB.EES..EEE..GG     400
 G....
 1hle.A
 MEQLSTANTHFAVDLFRALNESDPTGNIFISPLSISSALAMIFLGTRGNTAAQVSKALYFDTVEDIHSRFQSLNADINKP      80
 GAPYILKLANRLYGEKTYNFLADFLASTQKMYGAELASVDFQQAPEDARKEINEWVKGQTEGKIPELLVKGMVDNMTKLV     160
@@ Line 50: / Line 67: @@
 ''Note: In the "old style" HOW format for output, the sequence length will occupy the first five positions of the header line, right adjusted. The new HOW format reserves the first six positions for the sequence length.''
+==See also==
+*[[Tab file format]]
+*[[Column file format]]
+==References==
+<references/>
 ==External links==
-*[http://www.cbs.dtu.dk/%7Egorodkin/appl/plogo.html plogo] &mdash; Protein Sequence Logos using Relative Entropy
+*[http://string.embl.de/ STRING] &mdash; Search Tool for the Retrieval of Interacting Proteins
-*[http://www.cbs.dtu.dk/~gorodkin/appl/slogo.html RNA Structure Logo]
-*[http://www.predictprotein.org/ PredictProtein]
-*[http://www.ccrnp.ncifcrf.gov/~toms/sequencelogo.html A Gallery of Sequence Logos]
 [[Category:Bioinformatics]]

Difference between revisions of "HOW file format"

Latest revision as of 03:07, 24 June 2007

Contents

Data format

Secondary structure format (DSSP)

Example HOW file

See also

References

External links

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools