Comparative genomics

From Christoph's Personal Wiki
Jump to: navigation, search

Comparative genomics is the study of relationships between the genomes of different species or strains. Comparative genomics is an attempt to take advantage of the information provided by the signatures of selection to understand the function and evolutionary processes that act on genomes. While it is still a young field, it holds great promise to yield insights into many aspects of the evolution of modern species. The sheer amount of information contained in modern genomes (several gigabytes in the case of humans) necessitates that the methods of comparative genomics are mostly computational in nature. Gene finding is an important application of comparative genomics, as is discovery of new, non-coding functional elements of the genome.

I have done quite a lot of research in this field.


Bacterial genomes come in various shapes (linear vs. circular) and number of molecules (up to three chromosomes can be present). In addition, plasmids may be present.

  • The first complete genome to be sequenced was not that of a bacterium, but rather of a bacterial virus. It was bacteriophage ΦX174 (pronounced 'fie ex one seven four', a phage infecting E. coli [Accession #: J02482]. Note: Phage ΦX174 is a virus that packs its DNA as single strand DNA (ssDNA) in viroid particles so it only contains this positive strand in viroid form.) whose genome was completely sequenced in 1978 (Sanger et al., 1978).[1] This major achievement had been performed by subcloning [i.e., using restriction enzymes, including PstI, to cut the sequence into smaller pieces] and sequencing mapped fragments, after which the genome was pieced together. The sequence was produced using a method developed by Fred Sanger, based on the incorporation, into a synthetic DNA strand produced with DNA polymerase, of nucleotides with a missing 3'OH group (dideoxynucleotides). This meant that in a subset of products the next nucleotide could not be attached, resulting in termination of the product. Products were then separated by gel electrophoresis and visualization of bands allowed the sequence to be 'read'. Through the use of small amounts of each dideoxynucleotide in 4 separate reactions for each of the four bases and radioactive 32P isotopes, it was possible to run 4 different lanes on the gel and get various lengths of fragments. By tedious sequencing of subclones, the 5386 bp DNA sequence could eventually be pieced together. It took more than a year to sequence the ΦX174 genome. Sanger shared the Nobel Prize in chemistry for this work with Paul Berg and Walter Gilbert (his second, the first he won in 1958 for his work on protein structure).
Large collaborating teams made use of improved sequencing methodology to sequence within a year the Bacillus subtilis genome (Kunst et al., 1997) and the first Escherichia coli genome (Blattner et al., 1997), which were amongst the earliest bacterial genomes to be published.
  • For sequencing of all ΦX174 DNA, it was digested with restriction enzymes and individual fragments were cloned in a vector. These inserts were cloned (i.e., inserted in a vector replicating in E. coli), sequenced, and the use of overlapping fragments allowed the genome to be assembled. The term shotgun cloning was coined when a library of DNA fragments of varying (but defined) length and identity were cloned in individual vectors. Either case, after transformation colonies were selected; putative positive clones would produce a white color on the plates, allowing for relatively fast screening. Very large Petri plates were needed (at first cafeteria trays were used) to clone the fetal gamma-globin gene from the entire human genome (Blattner et al., 1978). Within a few years, the shotgun cloning method had been applied to sequencing, called shotgun DNA sequencing (Messing et al., 1981). Another breakthrough in genome sequencing was to produce a library of cloned fragments from a complete chromosome and sequence these "at random". The challenge was to assemble all these short sequences into a full chromosome, for which novel assembly programs had to be developed. This technology set the stage for sequencing larger DNA molecules, eventually including bacterial genomes.
  • A few years after the ΦX174 genome was published, scientists at the Los Alamos National Laboratory in New Mexico began discussions of sequencing the human genome. At the time the plan was highly ambitious: if it would take a thousand years to sequence a bacterial genome, it would take a million years to sequence the 3 billion bp human genome. Clearly the speed of sequencing was the limiting step. The U.S. Department of Energy decided to invest in technology to facilitate the speed of sequencing, with the goal of eventually being able to sequence the human genome in a practical time span. The Human Genome Sequencing project started in 1985, with a goal of investing $200,000,000 per year in technology to improve the speed of sequencing.
  • The first human genome cost $3 billion U.S., and took 15 years to finish; the second human genome (Craig Ventor's genome, sequenced by Celera), cost a "mere" $100 million, and took 9 months to finish; James Watson's genome took only 2 months, and cost $900,000.
  • The first bacterial genome sequence to be published, in 1995, was that of Haemophilus influenzae, an opportunistic human pathogen (Fleishmann et al., 1995, U.S. patent number 6,528,289). This species has a relatively small genome of 1.8 megabases (Mb) and sequencing the shotgun clones took approximately a year to complete.
  • The year 1995 also saw the second bacterial genome published, of Mycoplasma genitalium (Fraser et al., 1995, U.S. patent number 6,537,773), an intracellular human pathogen that is sexually transmitted. With this publication the field of comparative bacterial genomics was born: the two genome sequences were naturally compared and contrasted. With only 580,000 bp the genome of M. genitalium belongs to the smaller bacterial genomes. However, once again, things look different now, since even smaller genomes have been sequenced since, such as Nanoarchaeum equitans (a parasitic archaea living together with another archaea at extremely high temperatures) that only has 490,000 bp, and Carsonella ruddii, which is only 160,000 bp.
  • The first archaea was sequenced in 1996 (Methanocaldococcus jannischii, a methaneproducing thermophile) and apathogenic bacteria soon followed (as mentioned above, B. subtilis and E. coli K-12 were both published in 1997). A novelty was the publication of a second genome for one bacterial species, in 1999. The honor went to Helicobacter pylori, a pathogen living in the human stomach.
  • Like E. coli, most bacteria that we know of have a circular chromosome, but some have a linear chromosome (like Borrelia burgdorferi, the causative agent of Lyme's disease). Vibrio cholerae (causing cholera) has two chromosomes, some Burkholderia species (marine bacteria) have three and it is possible that there are bacteria out there with four or even more chromosomes.
  • Restriction enzymes and other enzymes used for DNA manipulation are usually named using a three-letter code, with the first letter of the genus name (thus it is upper case) and the next two for the species name. These letters are italic since the full name is also printed in italics, and can be followed by numbers or letters in roman text, e.g., EcoRI (derived from E. coli) or HindIII (from H. influenzae).

—Source: David Wayne Ussery, Trudy M. Wassenaar, Stefano Borini (2009). "Computing for Comparative Microbial Genomics: Bioinformatics for Microbiologists (Computational Biology)". Springer. ISBN 978-1-849-96763-1.
AT skew 
is a measure of the bias of A's towards one strand (and T's towards the other).
For some bacteria, the A's are biased towards the replication leading strand, but in other bacterial chromosomes, including E. coli, which this phage normally infects, the A's are biased towards the replication lagging strand.
intrinsic DNA, stacking energy, and position preference 
parameters that provide important insights in the physical and mechanical properties of the DNA molecule, which will affect how the molecule is folded. This again can potentially influence gene expression, the likelihood of genome rearrangements and even the occurrence of evolutionary hotspots.
inverted repeat 
the same piece of DNA (read from 5’ to 3’) is repeated on the opposite strand. In this case, the repeated sequences are found relatively close to each other (within 100 bp). These are called global repeats.
palindromic repeats (or palindromes) 
are also inverted repeats, but now the inverted repeat is in fact the complement of the original repeat unit. Palindromes are a special kind of local inverted repeats.
accession number 
The GenBank accession number (often simply referred to as accession number) is a primary key to uniquely identify a sequence entry in GenBank. Accession numbers are shared by EMBL and DDBJ, so that they are truly unique and can be used for information retrieval in all three databases. They usually have the format of two letters followed by six digits (AB123456), although older accession numbers can be a bit shorter: one letter, followed by 5 numbers (A12345). Although a primary key should ideally not change, an accession number is often followed by a period and a version number (e.g., AB123456.1), so if a sequence is revised by the authors who submitted the sequence, it is given a new version number but the rest of the accession number remains constant. Thus, for example, there are currently three different versions of the E. coli strain K-12, isolate MG1655, genome sequence: U00096.1 contains 4288 annotated genes, U00096.2 contains 4254 genes, and the most recent version, U00096.3 contains 4331 annotated genes.
RefSeq number
Some genomes in NCBI also have a special accession number called RefSeq number, and this is not the same as the GenBank accession number. To define RefSeq entries, sequences entered in GenBank are extracted, annotated and handcurated, and put back in the database with the RefSeq number as a new identification key. Thus, every entry with a RefSeq number will have a GenBank accession number, but not the other way round. The format is two letters followed by an underscore and six digits (AB_123456); the first two letters refer to the type of sequence, using codes like NT for contigs, NM for cDNA sequences constructed from mRNA, NP for proteins, NC for chromosomes and plasmids. NZ is for shotgun

unfinished sequences, which have a slightly different format: NZ followed by four letters and eight digits (i.e. NZ_ABCD12345678).

Genome Project ID (PID) 
each genome has been assigned a number. The PID overcomes the problem encountered with genomes containing more than one DNA segment (e.g., a chromosome and a plasmid, each with their own unique accession number). In addition, biological information is stored here to give more background on the strain that was sequenced.

Family tree


subspecies (ssp. / subsp.) 
the rank immediately subordinate to a species. It is equivalent to "race" in the biological sense.


Microbiology / Virology

A strain is a genetic variant or subtype of a virus, bacterium, or archaean. For example, a "flu strain" is a certain biological form of the influenza or "flu" virus.

biovar (bv.) 
a variant prokaryotic strain that differs physiologically and/or biochemically from other strains in a particular species.
pathovar (pv.) 
a bacterial strain or set of strains with the same or similar characteristics, that is differentiated at infrasubspecific level from other strains of the same species or subspecies on the basis of distinctive pathogenicity to one or more plant hosts.
those strains that differ physiologically.
serovar / serotype 
those strains that have antigenic properties that differ from other strains.


A strain is a group of plants with similar (but not identical) appearance and/or properties. The term has no official status.

a cultivated plant that has been selected and given a unique name because it has desirable characteristics (decorative or useful) that distinguish it from otherwise similar plants of the same species. When propagated it retains those characteristics.


A mouse or a rat strain is a group of animals that are genetically uniform. Strains are used in laboratory experiments. Mouse strains can be inbred, mutated, or genetically engineered, while rat strains are usually inbred.


  1. Sanger F, Air GM, Barrell BG, Brown NL, Coulson AR, Fiddes CA, Hutchison CA, Slocombe PM, Smith M (1977-02-24). "Nucleotide sequence of bacteriophage phi X174 DNA". Nature, 265(5596):687–695. PMID 870828. DOI:10.1038/265687a0


  • Blattner FR, et al. (1978). "Cloning human fetal gamma globin and mouse alpha-type globin DNA: preparation and screening of shotgun collections". Science, 202:1279-1284. PMID 725603.
  • Blattner FR, et al. (1997). "The complete genome sequence of Escherichia coli K-12". Science, 277:1432-1434. PMID 9278502.
  • Fleishmann RD, et al. (1995). "Whole-genome random sequencing and assembly of Haemophilus influenzae Rd". Science, 269:496-512. PMID 7542800'
  • Fraser CM, et al. (1995). "The minimal gene complement of Mycoplasma genitalium". Science, 270:397-403. PMID 7569993.
  • Harrison A, et al. (2005). "Genomic sequence of an otitis media isolate of nontypeable Haemophilus influenzae: comparative study with H. influenzae serotype d, strain KW20". J Bacteriol, 187:4627-4236. PMID 15968074.
  • Kunst F, et al. (1997). "The complete genome sequence of the gram-positive bacterium Bacillus subtilis". Nature, 390:249-256. PMID 15289476.
  • Makino K, et al. (1999). "Complete nucleotide sequence of the prophage VT2-Sakai carrying the verotoxin 2 genes of the enterohemorrhagic Escherichia coli O157:H7 derived from the Sakai outbreak". Genes Genet Syst, 74:227-239. PMID 10734605.
  • Messing J, Crea R, Seeburg PH (1981). "A system for shotgun DNA sequencing". Nucleic Acids Res, 9:309-321. PMID 6259625.
  • Pedersen AG, Jensen LJ, Stærfeldt HH, Brunak S, Ussery DW (1978). "A DNA Structural Atlas for Escherichia coli". J Mol Biol, 299:907-930. PMID 10843847.
  • Sanger F, et al. (1978). "The nucleotide sequence of bacteriophage phiX174". J Mol Biol, 125:225-246. PMID 731693.

External links

  • Cases Database — A continuously-updated, freely-accessible case report database allowing users to interactively explore data from peer-reviewed case reports.
  • Nucleic Acids Research: Database Issue — The latest news on the front of databases is presented annually in the January issue.

Lists of recently sequenced microbial genomes

RNA databases

Protein databases

  • UniProt
    the Knowledge database. This is the central access point for curated protein information, including function, classification and cross-reference.
    When searching UniProtKB, the hits are reported back as either ‘SwisProt’ or ‘TrEMBL’, depending on which database is used (the number of hits in Swiss-Prot is generally lower than that of TrEMBL for reasons explained above).
    the Reference Clusters database, which provides clustered groups of UniProtKB proteins with 100%, 90% or 50% sequence identity.
    the Archive database. This stores the complete body of publicly available protein sequence data.
  • PEDANT (Protein Extraction, Description and ANalysis Tool) — by the Munich Information Center for Protein Sequences
An alternative to UniProt. Here, under ‘Bacteria’ you can select your genome of choice from an alphabetical name list (links to the NCBI genome project and taxonomy database are provided for each entry). One click on your species of choice will provide a list of all predicted proteins with a short description and the best BLAST hit. For incomplete genome sequences a list of contigs is available. Proteinencoded genes are separated from RNA genes (called ‘genetic elements’ at this web site); the list of RNA genes is sorted for rRNA, tRNA and ‘miscellaneous’. Stem-loop structures are also specifically listed. For each gene of interest a detailed list is produced of predicted function, localization, protein structure, and general properties. FASTA files can be exported as text files.
  • ProDOM (Protein Domain database]
When you work with a novel protein gene for which you have little information, two other databases can be useful. If you want to predict a possible function of a query gene, it is worth searching ProDom (for Protein Domain database). ProDom uses a good graphical interface. The database consists of an automatic compilation of homologs domains detected in the Swiss-Prot database using a specific algorithm. It was devised to analyze specific domain arrangements within proteins. ProDom will identify similarities in domains that may or may not have conserved function. In this type of analysis, domain boundaries should always be treated with caution. For some domain families, ProDom has used the opinion of experts to correct domain boundaries on the basis of sequence and structural protein information.
  • ExPASy PROSITE — database of protein domains, families and functional sites. Release 20.83, of 04-Jul-2012 (1647 documentation entries, 1308 patterns, 1036 profiles and 1039 ProRule)
An alternative to ProDom. ProSite is not that different from ProDom, but it is based on conserved function. The database contains entries of biologically significant sites, patterns and profiles within well-characterized proteins that can help to reliably identify to which known family of protein (if any) a new protein sequence belongs.