Difference between revisions of "HOW file format"
(→External links) |
(→See also) |
||
| (3 intermediate revisions by the same user not shown) | |||
| Line 18: | Line 18: | ||
==Secondary structure format (DSSP)== | ==Secondary structure format (DSSP)== | ||
| − | The secondary structure format uses the DSSP assignment: | + | The secondary structure format uses the [[DSSP]] assignment<ref name="Kabsch1983">Kabsch W, Sander C (1983). "Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features". ''Biopolymers, 22(12):2577-2637. PMID: 6667333''.</ref>: |
G - 3-10 helix | G - 3-10 helix | ||
I - pi-helix | I - pi-helix | ||
| Line 67: | Line 67: | ||
''Note: In the "old style" HOW format for output, the sequence length will occupy the first five positions of the header line, right adjusted. The new HOW format reserves the first six positions for the sequence length.'' | ''Note: In the "old style" HOW format for output, the sequence length will occupy the first five positions of the header line, right adjusted. The new HOW format reserves the first six positions for the sequence length.'' | ||
| + | ==See also== | ||
| + | *[[Tab file format]] | ||
| + | *[[Column file format]] | ||
| + | |||
| + | ==References== | ||
| + | <references/> | ||
==External links== | ==External links== | ||
*[http://string.embl.de/ STRING] — Search Tool for the Retrieval of Interacting Proteins | *[http://string.embl.de/ STRING] — Search Tool for the Retrieval of Interacting Proteins | ||
[[Category:Bioinformatics]] | [[Category:Bioinformatics]] | ||
Latest revision as of 03:07, 24 June 2007
The HOW file format (or just HOW file).
Contents
Data format
Note: The following is taken directly from Søren Brunak's how man page.
For input data the how program uses a specially developed sequence format common to nucleotide and amino acid sequences. Traditionally, the files are given names with the extensions .seq or .how. The format is described below.
A HOW file consists of entries containing one sequence each. Each entry has the following three parts:
- Header line with the following syntax:
- Sequence length, in 6 positions, right adjusted;
- Exactly one SPACE;
- Sequence name, in at most 20 positions, left adjusted;
- Optional comment, separated from the sequence name by whitespace.
- Sequence data, in one-letter amino acid or nucleotide code in lines of exactly 80 symbols. Only the last line may be shorter than 80 symbols.
- Assignment data, in lines of exactly 80 characters. Only the last line may be shorter than 80 symbols. Each symbol in the assignment data is considered to answer to exactly one sequence residue. Thus, the number of the assignment symbols must be equal to the sequence length.
Both the sequence and assignment data lines may contain other data on the positions from 81 on (e.g., line numbering and comments). Such data is ignored by the how program.
Secondary structure format (DSSP)
The secondary structure format uses the DSSP assignment[1]:
G - 3-10 helix I - pi-helix H - alpha-helix E - extended beta-sheet B - beta-bridge S - bend L - other/loop . - unassigned
Note: Prediction servers typically use just three categories H, E, and L, where L is the rest. Sometimes H, G, and I are merged to H and sometimes E and B are merged to E, and L means the rest just as for the PredictProtein server.
Example HOW file
178 1cdy.-
KKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKSPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSD 80
TYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCT 160
VLQNQKKVEFKIDIVVLA
.EEEEEETTS.EEE..B..SSSS..EEEEETTS.EEEEEETTEEEE.S.TTGGGEE..GGGGGGTB..EEE.S..GGG.E 80
EEEEEETTEEEEEEEEEEEEEE.S.SEEETT..EEEEEE..TT....B..B.TTS.B..BSSEEEESS..STT.EEEEEE 160
EEETTEEEEEEEEEEEE.
405 1phb.-
NLAPLPPHVPEHLVFDFDMYNPSNLSAGVQEAWAVLQESNVPDLVWTRCNGGHWIATRGQLIREAYEDYRHFSSECPFIP 80
REAGEAYDFIPTSMDPPEQRQFRALANQVVGMPVVDKLENRIQELACSLIESLRPQGQCNFTEDYAEPFPIRIFMLLAGL 160
PEEDIPHLKYLTDQMTRPDGSMTFAEAKEALYDYLIPIIEQRRQKPGTDAISIVANGQVNGRPITSDEAKRMCGLLLVGG 240
LDTVVNFLSFSMEFLAKSPEHRQELIERPERIPAACEELLRRFSLVADGRILTSDYEFHGVQLKKGDQILLPQMLSGLDE 320
RENACPMHVDFSRQKVSHTTFGHGSHLCLGQHLARREIIVTLKEWLTRIPDFSIAPGAQIQHKSGIVSGVQALPLVWDPA 400
TTKAV
......TTS.GGGB....TTS.TTGGG.HHHHHGGGGSTTS.SEEEE.GGG.EEEE.SHHHHHHHHH.TTTEETTS.SSS 80
HHHHHH...TTTT..TTTHHHHHHHHHHHHSHHHHHHHHHHHHHHHHHHHHHHGGGSEEEHHHHTTTHHHHHHHHHHHT. 160
.GGGHHHHHHHHHHHHS..SSS.HHHHHHHHHHHHHHHHHHHHHS..SSHHHHHHT.EETTEE..HHHHHHHHHHHHHHH 240
HHHHHHHHHHHHHHHHH.HHHHHHHHH.GGGHHHHHHHHHHHT..B..EEEESS.EEETTEEE.TT.EEE..GGGTTT.T 320
TTSSSTTS..TT.S.....TT..GGG..TTHHHHHHHHHHHHHHHHHH....EE.TT....EE.SSB.EES..EEE..GG 400
G....
344 1hle.A
MEQLSTANTHFAVDLFRALNESDPTGNIFISPLSISSALAMIFLGTRGNTAAQVSKALYFDTVEDIHSRFQSLNADINKP 80
GAPYILKLANRLYGEKTYNFLADFLASTQKMYGAELASVDFQQAPEDARKEINEWVKGQTEGKIPELLVKGMVDNMTKLV 160
LVNAIYFKGNWQQKFMKEATRDAPFRLNKKDTKTVKMMYQKKKFPYNYIEDLKCRVLELPYQGKELSMIILLPDDIEDES 240
TGLEKIEKQLTLDKLREWTKPENLYLAEVNVHLPRFKLEESYDLTSHLARLGVQDLFNRGKADLSGMSGARDLFVSKIIH 320
KSFVDLNEEGTEAAAATAGTILLA
.HHHHHHHHHHHHHHHHHHHHH.SSS.EEE.HHHHHHHHHHHHHT..HHHHHHHHHHHTGGGSTTHHHHHHHHHHHHT.S 80
S.SSEEEEEEEEEEETT....HHHHHHHHHHH..EEEEE.TTT.HHHHHHHHHHHHHHHTTTSSS.SS.TTSS.TTEEEE 160
EEEEEEEEEEBSS...GGG.EEEEEESSSS.EEEEEEEEEEEEEEEEEEGGGTEEEEEEEBTTSSEEEEEEEESS..SSS 240
SS.HHHHHT..HHHHHHHH.GGG.EEEEEEEEEE.EEEEEEEE.HHHHHHHT..GGG.TTT...HHHHSSS.EEEEEEEE 320
EEEEEE.SSEEEEEEEEEEEEEE.
Note: In the "old style" HOW format for output, the sequence length will occupy the first five positions of the header line, right adjusted. The new HOW format reserves the first six positions for the sequence length.
See also
References
- ↑ Kabsch W, Sander C (1983). "Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features". Biopolymers, 22(12):2577-2637. PMID: 6667333.
External links
- STRING — Search Tool for the Retrieval of Interacting Proteins