Difference between revisions of "HOW file format"
(→See also) |
|||
(7 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
The '''HOW file format''' (or just '''HOW file'''). | The '''HOW file format''' (or just '''HOW file'''). | ||
+ | |||
+ | ==Data format== | ||
+ | ''Note: The following is taken directly from Søren Brunak's how man page.'' | ||
+ | |||
+ | For input data the how program uses a specially developed sequence format common to nucleotide and amino acid sequences. Traditionally, the files are given names with the extensions <code>.seq</code> or <code>.how</code>. The format is described below. | ||
+ | |||
+ | A HOW file consists of entries containing one sequence each. Each entry has the following three parts: | ||
+ | #Header line with the following syntax: | ||
+ | #*Sequence length, in 6 positions, right adjusted; | ||
+ | #*Exactly one SPACE; | ||
+ | #*Sequence name, in at most 20 positions, left adjusted; | ||
+ | #*Optional comment, separated from the sequence name by whitespace. | ||
+ | #Sequence data, in one-letter amino acid or nucleotide code in lines of exactly 80 symbols. Only the last line may be shorter than 80 symbols. | ||
+ | #Assignment data, in lines of exactly 80 characters. Only the last line may be shorter than 80 symbols. Each symbol in the assignment data is considered to answer to exactly one sequence residue. Thus, the number of the assignment symbols must be equal to the sequence length. | ||
+ | |||
+ | Both the sequence and assignment data lines may contain other data on the positions from 81 on (e.g., line numbering and comments). Such data is ignored by the how program. | ||
==Secondary structure format (DSSP)== | ==Secondary structure format (DSSP)== | ||
− | The secondary structure format uses the DSSP assignment: | + | The secondary structure format uses the [[DSSP]] assignment<ref name="Kabsch1983">Kabsch W, Sander C (1983). "Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features". ''Biopolymers, 22(12):2577-2637. PMID: 6667333''.</ref>: |
− | G 3-10 helix | + | G - 3-10 helix |
− | I pi-helix | + | I - pi-helix |
− | H alpha-helix | + | H - alpha-helix |
− | E extended beta-sheet | + | E - extended beta-sheet |
− | B beta-bridge | + | B - beta-bridge |
− | S bend | + | S - bend |
− | L other/loop | + | L - other/loop |
+ | . - unassigned | ||
''Note: Prediction servers typically use just three categories H, E, and L, where L is the rest. Sometimes H, G, and I are merged to H and sometimes E and B are merged to E, and L means the rest just as for the [http://www.predictprotein.org/ PredictProtein server].'' | ''Note: Prediction servers typically use just three categories H, E, and L, where L is the rest. Sometimes H, G, and I are merged to H and sometimes E and B are merged to E, and L means the rest just as for the [http://www.predictprotein.org/ PredictProtein server].'' | ||
Line 15: | Line 32: | ||
==Example HOW file== | ==Example HOW file== | ||
<pre> | <pre> | ||
− | + | 178 1cdy.- | |
KKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKSPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSD 80 | KKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKSPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSD 80 | ||
TYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCT 160 | TYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCT 160 | ||
Line 22: | Line 39: | ||
EEEEEETTEEEEEEEEEEEEEE.S.SEEETT..EEEEEE..TT....B..B.TTS.B..BSSEEEESS..STT.EEEEEE 160 | EEEEEETTEEEEEEEEEEEEEE.S.SEEETT..EEEEEE..TT....B..B.TTS.B..BSSEEEESS..STT.EEEEEE 160 | ||
EEETTEEEEEEEEEEEE. | EEETTEEEEEEEEEEEE. | ||
− | + | 405 1phb.- | |
NLAPLPPHVPEHLVFDFDMYNPSNLSAGVQEAWAVLQESNVPDLVWTRCNGGHWIATRGQLIREAYEDYRHFSSECPFIP 80 | NLAPLPPHVPEHLVFDFDMYNPSNLSAGVQEAWAVLQESNVPDLVWTRCNGGHWIATRGQLIREAYEDYRHFSSECPFIP 80 | ||
REAGEAYDFIPTSMDPPEQRQFRALANQVVGMPVVDKLENRIQELACSLIESLRPQGQCNFTEDYAEPFPIRIFMLLAGL 160 | REAGEAYDFIPTSMDPPEQRQFRALANQVVGMPVVDKLENRIQELACSLIESLRPQGQCNFTEDYAEPFPIRIFMLLAGL 160 | ||
Line 35: | Line 52: | ||
TTSSSTTS..TT.S.....TT..GGG..TTHHHHHHHHHHHHHHHHHH....EE.TT....EE.SSB.EES..EEE..GG 400 | TTSSSTTS..TT.S.....TT..GGG..TTHHHHHHHHHHHHHHHHHH....EE.TT....EE.SSB.EES..EEE..GG 400 | ||
G.... | G.... | ||
− | + | 344 1hle.A | |
MEQLSTANTHFAVDLFRALNESDPTGNIFISPLSISSALAMIFLGTRGNTAAQVSKALYFDTVEDIHSRFQSLNADINKP 80 | MEQLSTANTHFAVDLFRALNESDPTGNIFISPLSISSALAMIFLGTRGNTAAQVSKALYFDTVEDIHSRFQSLNADINKP 80 | ||
GAPYILKLANRLYGEKTYNFLADFLASTQKMYGAELASVDFQQAPEDARKEINEWVKGQTEGKIPELLVKGMVDNMTKLV 160 | GAPYILKLANRLYGEKTYNFLADFLASTQKMYGAELASVDFQQAPEDARKEINEWVKGQTEGKIPELLVKGMVDNMTKLV 160 | ||
Line 50: | Line 67: | ||
''Note: In the "old style" HOW format for output, the sequence length will occupy the first five positions of the header line, right adjusted. The new HOW format reserves the first six positions for the sequence length.'' | ''Note: In the "old style" HOW format for output, the sequence length will occupy the first five positions of the header line, right adjusted. The new HOW format reserves the first six positions for the sequence length.'' | ||
+ | ==See also== | ||
+ | *[[Tab file format]] | ||
+ | *[[Column file format]] | ||
+ | |||
+ | ==References== | ||
+ | <references/> | ||
==External links== | ==External links== | ||
− | *[http:// | + | *[http://string.embl.de/ STRING] — Search Tool for the Retrieval of Interacting Proteins |
− | + | ||
− | + | ||
− | + | ||
[[Category:Bioinformatics]] | [[Category:Bioinformatics]] |
Latest revision as of 03:07, 24 June 2007
The HOW file format (or just HOW file).
Contents
Data format
Note: The following is taken directly from Søren Brunak's how man page.
For input data the how program uses a specially developed sequence format common to nucleotide and amino acid sequences. Traditionally, the files are given names with the extensions .seq
or .how
. The format is described below.
A HOW file consists of entries containing one sequence each. Each entry has the following three parts:
- Header line with the following syntax:
- Sequence length, in 6 positions, right adjusted;
- Exactly one SPACE;
- Sequence name, in at most 20 positions, left adjusted;
- Optional comment, separated from the sequence name by whitespace.
- Sequence data, in one-letter amino acid or nucleotide code in lines of exactly 80 symbols. Only the last line may be shorter than 80 symbols.
- Assignment data, in lines of exactly 80 characters. Only the last line may be shorter than 80 symbols. Each symbol in the assignment data is considered to answer to exactly one sequence residue. Thus, the number of the assignment symbols must be equal to the sequence length.
Both the sequence and assignment data lines may contain other data on the positions from 81 on (e.g., line numbering and comments). Such data is ignored by the how program.
Secondary structure format (DSSP)
The secondary structure format uses the DSSP assignment[1]:
G - 3-10 helix I - pi-helix H - alpha-helix E - extended beta-sheet B - beta-bridge S - bend L - other/loop . - unassigned
Note: Prediction servers typically use just three categories H, E, and L, where L is the rest. Sometimes H, G, and I are merged to H and sometimes E and B are merged to E, and L means the rest just as for the PredictProtein server.
Example HOW file
178 1cdy.- KKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKSPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSD 80 TYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCT 160 VLQNQKKVEFKIDIVVLA .EEEEEETTS.EEE..B..SSSS..EEEEETTS.EEEEEETTEEEE.S.TTGGGEE..GGGGGGTB..EEE.S..GGG.E 80 EEEEEETTEEEEEEEEEEEEEE.S.SEEETT..EEEEEE..TT....B..B.TTS.B..BSSEEEESS..STT.EEEEEE 160 EEETTEEEEEEEEEEEE. 405 1phb.- NLAPLPPHVPEHLVFDFDMYNPSNLSAGVQEAWAVLQESNVPDLVWTRCNGGHWIATRGQLIREAYEDYRHFSSECPFIP 80 REAGEAYDFIPTSMDPPEQRQFRALANQVVGMPVVDKLENRIQELACSLIESLRPQGQCNFTEDYAEPFPIRIFMLLAGL 160 PEEDIPHLKYLTDQMTRPDGSMTFAEAKEALYDYLIPIIEQRRQKPGTDAISIVANGQVNGRPITSDEAKRMCGLLLVGG 240 LDTVVNFLSFSMEFLAKSPEHRQELIERPERIPAACEELLRRFSLVADGRILTSDYEFHGVQLKKGDQILLPQMLSGLDE 320 RENACPMHVDFSRQKVSHTTFGHGSHLCLGQHLARREIIVTLKEWLTRIPDFSIAPGAQIQHKSGIVSGVQALPLVWDPA 400 TTKAV ......TTS.GGGB....TTS.TTGGG.HHHHHGGGGSTTS.SEEEE.GGG.EEEE.SHHHHHHHHH.TTTEETTS.SSS 80 HHHHHH...TTTT..TTTHHHHHHHHHHHHSHHHHHHHHHHHHHHHHHHHHHHGGGSEEEHHHHTTTHHHHHHHHHHHT. 160 .GGGHHHHHHHHHHHHS..SSS.HHHHHHHHHHHHHHHHHHHHHS..SSHHHHHHT.EETTEE..HHHHHHHHHHHHHHH 240 HHHHHHHHHHHHHHHHH.HHHHHHHHH.GGGHHHHHHHHHHHT..B..EEEESS.EEETTEEE.TT.EEE..GGGTTT.T 320 TTSSSTTS..TT.S.....TT..GGG..TTHHHHHHHHHHHHHHHHHH....EE.TT....EE.SSB.EES..EEE..GG 400 G.... 344 1hle.A MEQLSTANTHFAVDLFRALNESDPTGNIFISPLSISSALAMIFLGTRGNTAAQVSKALYFDTVEDIHSRFQSLNADINKP 80 GAPYILKLANRLYGEKTYNFLADFLASTQKMYGAELASVDFQQAPEDARKEINEWVKGQTEGKIPELLVKGMVDNMTKLV 160 LVNAIYFKGNWQQKFMKEATRDAPFRLNKKDTKTVKMMYQKKKFPYNYIEDLKCRVLELPYQGKELSMIILLPDDIEDES 240 TGLEKIEKQLTLDKLREWTKPENLYLAEVNVHLPRFKLEESYDLTSHLARLGVQDLFNRGKADLSGMSGARDLFVSKIIH 320 KSFVDLNEEGTEAAAATAGTILLA .HHHHHHHHHHHHHHHHHHHHH.SSS.EEE.HHHHHHHHHHHHHT..HHHHHHHHHHHTGGGSTTHHHHHHHHHHHHT.S 80 S.SSEEEEEEEEEEETT....HHHHHHHHHHH..EEEEE.TTT.HHHHHHHHHHHHHHHTTTSSS.SS.TTSS.TTEEEE 160 EEEEEEEEEEBSS...GGG.EEEEEESSSS.EEEEEEEEEEEEEEEEEEGGGTEEEEEEEBTTSSEEEEEEEESS..SSS 240 SS.HHHHHT..HHHHHHHH.GGG.EEEEEEEEEE.EEEEEEEE.HHHHHHHT..GGG.TTT...HHHHSSS.EEEEEEEE 320 EEEEEE.SSEEEEEEEEEEEEEE.
Note: In the "old style" HOW format for output, the sequence length will occupy the first five positions of the header line, right adjusted. The new HOW format reserves the first six positions for the sequence length.
See also
References
- ↑ Kabsch W, Sander C (1983). "Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features". Biopolymers, 22(12):2577-2637. PMID: 6667333.
External links
- STRING — Search Tool for the Retrieval of Interacting Proteins