HOW file format
The HOW file format (or just HOW file).
Contents
Data format
Note: The following is taken directly from Søren Brunak's how man page.
For input data the how program uses a specially developed sequence format common to nucleotide and amino acid sequences. Traditionally, the files are given names with the extensions .seq
or .how
. The format is described below.
A HOW file consists of entries containing one sequence each. Each entry has the following three parts:
- Header line with the following syntax:
- Sequence length, in 6 positions, right adjusted;
- Exactly one SPACE;
- Sequence name, in at most 20 positions, left adjusted;
- Optional comment, separated from the sequence name by whitespace.
- Sequence data, in one-letter amino acid or nucleotide code in lines of exactly 80 symbols. Only the last line may be shorter than 80 symbols.
- Assignment data, in lines of exactly 80 characters. Only the last line may be shorter than 80 symbols. Each symbol in the assignment data is considered to answer to exactly one sequence residue. Thus, the number of the assignment symbols must be equal to the sequence length.
Both the sequence and assignment data lines may contain other data on the positions from 81 on (e.g., line numbering and comments). Such data is ignored by the how program.
Secondary structure format (DSSP)
The secondary structure format uses the DSSP assignment[1]:
G - 3-10 helix I - pi-helix H - alpha-helix E - extended beta-sheet B - beta-bridge S - bend L - other/loop . - unassigned
Note: Prediction servers typically use just three categories H, E, and L, where L is the rest. Sometimes H, G, and I are merged to H and sometimes E and B are merged to E, and L means the rest just as for the PredictProtein server.
Example HOW file
178 1cdy.- KKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKSPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSD 80 TYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCT 160 VLQNQKKVEFKIDIVVLA .EEEEEETTS.EEE..B..SSSS..EEEEETTS.EEEEEETTEEEE.S.TTGGGEE..GGGGGGTB..EEE.S..GGG.E 80 EEEEEETTEEEEEEEEEEEEEE.S.SEEETT..EEEEEE..TT....B..B.TTS.B..BSSEEEESS..STT.EEEEEE 160 EEETTEEEEEEEEEEEE. 405 1phb.- NLAPLPPHVPEHLVFDFDMYNPSNLSAGVQEAWAVLQESNVPDLVWTRCNGGHWIATRGQLIREAYEDYRHFSSECPFIP 80 REAGEAYDFIPTSMDPPEQRQFRALANQVVGMPVVDKLENRIQELACSLIESLRPQGQCNFTEDYAEPFPIRIFMLLAGL 160 PEEDIPHLKYLTDQMTRPDGSMTFAEAKEALYDYLIPIIEQRRQKPGTDAISIVANGQVNGRPITSDEAKRMCGLLLVGG 240 LDTVVNFLSFSMEFLAKSPEHRQELIERPERIPAACEELLRRFSLVADGRILTSDYEFHGVQLKKGDQILLPQMLSGLDE 320 RENACPMHVDFSRQKVSHTTFGHGSHLCLGQHLARREIIVTLKEWLTRIPDFSIAPGAQIQHKSGIVSGVQALPLVWDPA 400 TTKAV ......TTS.GGGB....TTS.TTGGG.HHHHHGGGGSTTS.SEEEE.GGG.EEEE.SHHHHHHHHH.TTTEETTS.SSS 80 HHHHHH...TTTT..TTTHHHHHHHHHHHHSHHHHHHHHHHHHHHHHHHHHHHGGGSEEEHHHHTTTHHHHHHHHHHHT. 160 .GGGHHHHHHHHHHHHS..SSS.HHHHHHHHHHHHHHHHHHHHHS..SSHHHHHHT.EETTEE..HHHHHHHHHHHHHHH 240 HHHHHHHHHHHHHHHHH.HHHHHHHHH.GGGHHHHHHHHHHHT..B..EEEESS.EEETTEEE.TT.EEE..GGGTTT.T 320 TTSSSTTS..TT.S.....TT..GGG..TTHHHHHHHHHHHHHHHHHH....EE.TT....EE.SSB.EES..EEE..GG 400 G.... 344 1hle.A MEQLSTANTHFAVDLFRALNESDPTGNIFISPLSISSALAMIFLGTRGNTAAQVSKALYFDTVEDIHSRFQSLNADINKP 80 GAPYILKLANRLYGEKTYNFLADFLASTQKMYGAELASVDFQQAPEDARKEINEWVKGQTEGKIPELLVKGMVDNMTKLV 160 LVNAIYFKGNWQQKFMKEATRDAPFRLNKKDTKTVKMMYQKKKFPYNYIEDLKCRVLELPYQGKELSMIILLPDDIEDES 240 TGLEKIEKQLTLDKLREWTKPENLYLAEVNVHLPRFKLEESYDLTSHLARLGVQDLFNRGKADLSGMSGARDLFVSKIIH 320 KSFVDLNEEGTEAAAATAGTILLA .HHHHHHHHHHHHHHHHHHHHH.SSS.EEE.HHHHHHHHHHHHHT..HHHHHHHHHHHTGGGSTTHHHHHHHHHHHHT.S 80 S.SSEEEEEEEEEEETT....HHHHHHHHHHH..EEEEE.TTT.HHHHHHHHHHHHHHHTTTSSS.SS.TTSS.TTEEEE 160 EEEEEEEEEEBSS...GGG.EEEEEESSSS.EEEEEEEEEEEEEEEEEEGGGTEEEEEEEBTTSSEEEEEEEESS..SSS 240 SS.HHHHHT..HHHHHHHH.GGG.EEEEEEEEEE.EEEEEEEE.HHHHHHHT..GGG.TTT...HHHHSSS.EEEEEEEE 320 EEEEEE.SSEEEEEEEEEEEEEE.
Note: In the "old style" HOW format for output, the sequence length will occupy the first five positions of the header line, right adjusted. The new HOW format reserves the first six positions for the sequence length.
See also
References
- ↑ Kabsch W, Sander C (1983). "Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features". Biopolymers, 22(12):2577-2637. PMID: 6667333.
External links
- STRING — Search Tool for the Retrieval of Interacting Proteins