SEGUID
The SEquence Globally Unique IDentifier (SEGUID) Proteome Database contains a unique protein sequence identifier based on the Secure Hash Algorithm (SHA-1) digest of the primary sequence because our bioinformatics, analytical, and high-throughput proteomics pipelines suffered from changing and disappearing protein identifiers. A SEGUID is stable for the lifetime of a protein and is used as the central identifier while all other aliases are treated as dynamic properties. Everyone can derive the same SEGUID from the sequence information, which allows easy data sharing. The use of SEGUID ensures that proteomics data is resilient to changes in annotation databases and the reports generated reflect the most recent annotations collected from sequence databases. Our SEGUID website provides a number of web applications and web services which are described in this manuscript. The FTP site provides pre-calculated data, FASTA files, alias tables, and sample programs describing the web services and their consumption by other applications.
SEGUID is meant to replace the 64-bit Cyclic Redundancy Check (CRC64).
Example
Below are two nearly identical immunoglobulin fragments (except for where indicated with "*"). The SEGUID is the 3rd field after the '>' sign.
>gnl|sha|BpBeDdcNUYNsdk46JoJdw7Pd3BI|immunoglobulin lambda light chain variable region [Homo sapiens] QSALTQPASVSGSPGQSITISCTGTSSDVGSYNLVSWYQQHPGKAPKLMIYEGSKRPSGV SNRFSGSKSGNTASLTISGLQAEDEADYYCSSYAGSSTLVFGGGTKLTVL * >gnl|sha|X5XEaayob1nZLOc7eVT9qyczarY|immunoglobulin lambda light chain variable region [Homo sapiens] QSALTQPASVSGSPGQSITISCTGTSSDVGSYNLVSWYQQHPGKAPKLMIYEGSKRPSGV SNRFSGSKSGNTASLTISGLQAEDEADYYCCSYAGSSTWVFGGGTKLTVL *
References
- Clark T, Martin S, Liefeld T. (2004). Globally distributed object identification for biological knowledgebases. Brief Bioinform. 5(1):59-70.
- Iragne, F., Barre, A., Goffard, N. and de Daruvar, A. (2004) AliasServer: a web server to handle multiple aliases used to refer to proteins. Bioinformatics 20(14): 2331–2332
- Leinonen, R., Diez, F.G., Binns, D., Fleischmann, W., Lopez, R. and Apweiler, R. (2004) UniProt archive.Bioinformatics. 20(17):3236-3237.
- Benson,D.A., Karsch-Mizrachi, I., Lipman,D.J., Ostell,J. and Wheeler,D.L. (2005) GenBank. Nucleic Acids Res., 33, D34–D38.
- Wu,C.H., Yeh,L.S.L., Huang,H., Arminski,L., Castro-Alvear,J.,Chen,Y., Hu,Z., Kourtesis,P., Ledley,R.S., Suzek,B.E. et al. (2003) The Protein Information Resource. Nucleic Acids Res., 31, 345–347.
- Boeckmann,B., Bairoch,A., Apweiler,R., Blatter,M.C., Estreicher,A.,Gasteiger,E., Martin,M.J., Michoud,K., O’Donovan,C., Phan,I. et al.(2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res., 31, 365–370.
- Bourne,P.E., Addess,K.J., Bluhm,WF, Chen,L., Deshpande,N., Feng,Z.,Fleri,W., Green,R., Merino-Ott,J.C., Townsend-Merino,W. et al. (2004)The distribution and query systems of the RCSB Protein Data Bank.Nucleic Acids Res., 32, D223–D225.
- Maglott,D., Ostell,J., Pruitt,K.D. and Tatusova,T. (2005) Entrez Gene:gene-centered information at NCBI. Nucleic Acids Res., 33, D54–D58.
- Altschul,S.E., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410.
- Smith, M., Kunin, V., Goldovsky, L., Enright, A.J. and Ouzounis, C.A. (2005) MagicMatch – cross-referencing sequence identifiers across databases. Bioinformatics 21(16): 3429 – 3430.
- Snell, J., Tidwell, D. and Kulchenko, P. (2002) Programming Web Services with SOAP, 1st edn. O'Reilly Publishers, Sebastopol, CA.
- Wheeler, D.L. et al. (2005) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res., 33, D39-D45.
- Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS (2005) The Universal Protein Resource (UniProt). Nucleic Acids Res. 33: D154-159.
- Kanehisa, M. and Goto, S. (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res., 28, 27-30.
- Secure Hash Standard (1995) Federal Information Processing Standards Publication 180-1.
- Rivest, R.L (1991) The MD4 Message Digest Algorithm, Advances in Cryptology, CRYPTO’90 Proceedings, Springer-Verlag, pp 303-311.
- Babnigg, G. and Giometti, C.S. (2004) GELBANK: a database of annotated two-dimensional gel electrophoresis patterns of biological systems with completed genomes. Nucleic Acids Res. 32: D582-D585.
- Kawashima S, Ogata H, Kanehisa M (1999) AAindex: Amino Acid Index Database. Nucleic Acids Res. 27(1):368-9.
- Hirokawa T, Boon-Chieng S, Mitaku S. (1998) SOSUI: classification and secondary structure prediction system for membrane proteins. Bioinformatics. 14(4):378-379.
- Cserzo M, Wallin E, Simon I, von Heijne G, and Elofsson A. (1997) Prediction of transmembrane alpha-helices in prokaryotic membrane proteins: the dense alignment surface method. Protein Eng. 10(6):673-6.