HOW file format

From Christoph's Personal Wiki
Jump to: navigation, search

The HOW file format (or just HOW file).

Data format

Note: The following is taken directly from Søren Brunak's how man page.

For input data the how program uses a specially developed sequence format common to nucleotide and amino acid sequences. Traditionally, the files are given names with the extensions .seq or .how. The format is described below.

A HOW file consists of entries containing one sequence each. Each entry has the following three parts:

  1. Header line with the following syntax:
    • Sequence length, in 6 positions, right adjusted;
    • Exactly one SPACE;
    • Sequence name, in at most 20 positions, left adjusted;
    • Optional comment, separated from the sequence name by whitespace.
  2. Sequence data, in one-letter amino acid or nucleotide code in lines of exactly 80 symbols. Only the last line may be shorter than 80 symbols.
  3. Assignment data, in lines of exactly 80 characters. Only the last line may be shorter than 80 symbols. Each symbol in the assignment data is considered to answer to exactly one sequence residue. Thus, the number of the assignment symbols must be equal to the sequence length.

Both the sequence and assignment data lines may contain other data on the positions from 81 on (e.g., line numbering and comments). Such data is ignored by the how program.

Secondary structure format (DSSP)

The secondary structure format uses the DSSP assignment[1]:

G - 3-10 helix
I - pi-helix
H - alpha-helix
E - extended beta-sheet
B - beta-bridge
S - bend
L - other/loop
. - unassigned

Note: Prediction servers typically use just three categories H, E, and L, where L is the rest. Sometimes H, G, and I are merged to H and sometimes E and B are merged to E, and L means the rest just as for the PredictProtein server.

Example HOW file

      178 1cdy.-
KKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKSPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSD      80
TYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCT     160
VLQNQKKVEFKIDIVVLA
.EEEEEETTS.EEE..B..SSSS..EEEEETTS.EEEEEETTEEEE.S.TTGGGEE..GGGGGGTB..EEE.S..GGG.E      80
EEEEEETTEEEEEEEEEEEEEE.S.SEEETT..EEEEEE..TT....B..B.TTS.B..BSSEEEESS..STT.EEEEEE     160
EEETTEEEEEEEEEEEE.
      405 1phb.-
NLAPLPPHVPEHLVFDFDMYNPSNLSAGVQEAWAVLQESNVPDLVWTRCNGGHWIATRGQLIREAYEDYRHFSSECPFIP      80
REAGEAYDFIPTSMDPPEQRQFRALANQVVGMPVVDKLENRIQELACSLIESLRPQGQCNFTEDYAEPFPIRIFMLLAGL     160
PEEDIPHLKYLTDQMTRPDGSMTFAEAKEALYDYLIPIIEQRRQKPGTDAISIVANGQVNGRPITSDEAKRMCGLLLVGG     240
LDTVVNFLSFSMEFLAKSPEHRQELIERPERIPAACEELLRRFSLVADGRILTSDYEFHGVQLKKGDQILLPQMLSGLDE     320
RENACPMHVDFSRQKVSHTTFGHGSHLCLGQHLARREIIVTLKEWLTRIPDFSIAPGAQIQHKSGIVSGVQALPLVWDPA     400
TTKAV
......TTS.GGGB....TTS.TTGGG.HHHHHGGGGSTTS.SEEEE.GGG.EEEE.SHHHHHHHHH.TTTEETTS.SSS      80
HHHHHH...TTTT..TTTHHHHHHHHHHHHSHHHHHHHHHHHHHHHHHHHHHHGGGSEEEHHHHTTTHHHHHHHHHHHT.     160
.GGGHHHHHHHHHHHHS..SSS.HHHHHHHHHHHHHHHHHHHHHS..SSHHHHHHT.EETTEE..HHHHHHHHHHHHHHH     240
HHHHHHHHHHHHHHHHH.HHHHHHHHH.GGGHHHHHHHHHHHT..B..EEEESS.EEETTEEE.TT.EEE..GGGTTT.T     320
TTSSSTTS..TT.S.....TT..GGG..TTHHHHHHHHHHHHHHHHHH....EE.TT....EE.SSB.EES..EEE..GG     400
G....
      344 1hle.A
MEQLSTANTHFAVDLFRALNESDPTGNIFISPLSISSALAMIFLGTRGNTAAQVSKALYFDTVEDIHSRFQSLNADINKP      80
GAPYILKLANRLYGEKTYNFLADFLASTQKMYGAELASVDFQQAPEDARKEINEWVKGQTEGKIPELLVKGMVDNMTKLV     160
LVNAIYFKGNWQQKFMKEATRDAPFRLNKKDTKTVKMMYQKKKFPYNYIEDLKCRVLELPYQGKELSMIILLPDDIEDES     240
TGLEKIEKQLTLDKLREWTKPENLYLAEVNVHLPRFKLEESYDLTSHLARLGVQDLFNRGKADLSGMSGARDLFVSKIIH     320
KSFVDLNEEGTEAAAATAGTILLA
.HHHHHHHHHHHHHHHHHHHHH.SSS.EEE.HHHHHHHHHHHHHT..HHHHHHHHHHHTGGGSTTHHHHHHHHHHHHT.S      80
S.SSEEEEEEEEEEETT....HHHHHHHHHHH..EEEEE.TTT.HHHHHHHHHHHHHHHTTTSSS.SS.TTSS.TTEEEE     160
EEEEEEEEEEBSS...GGG.EEEEEESSSS.EEEEEEEEEEEEEEEEEEGGGTEEEEEEEBTTSSEEEEEEEESS..SSS     240
SS.HHHHHT..HHHHHHHH.GGG.EEEEEEEEEE.EEEEEEEE.HHHHHHHT..GGG.TTT...HHHHSSS.EEEEEEEE     320
EEEEEE.SSEEEEEEEEEEEEEE.

Note: In the "old style" HOW format for output, the sequence length will occupy the first five positions of the header line, right adjusted. The new HOW format reserves the first six positions for the sequence length.

See also

References

  1. Kabsch W, Sander C (1983). "Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features". Biopolymers, 22(12):2577-2637. PMID: 6667333.

External links

  • STRING — Search Tool for the Retrieval of Interacting Proteins