Difference between revisions of "Exome Project"

From Christoph's Personal Wiki
Jump to: navigation, search
(See also)
 
(17 intermediate revisions by the same user not shown)
Line 31: Line 31:
  
 
==Glossary==
 
==Glossary==
 +
;Mendelian disorders : Phenotypes caused by a mutation (or mutations) in a single gene and inherited in a dominant, recessive or X-linked pattern.
 +
;Penetrance : The proportion of individuals with a specific phenotype among carriers of a particular genotype.
 +
;Locus heterogeneity : The appearance of phenotypically similar characteristics resulting from mutations at different genetic loci. Differences in effect size or in replication between studies and samples are often ascribed to different loci leading to the same disease.
 +
;Genome-wide association studies (GWASs): Studies that search for a population association between a phenotype and a particular allele by screening loci (most commonly by genotyping SNPs) across the entire genome.
 +
;Complex traits : Traits that are influenced by the environment and/or through a combination of variants in at least several genes, each of which has a small effect.
 +
;Heritability : The proportion of the total phenotypic variation in a given characteristic that can be attributed to additive genetic effects.
 +
;Next-generation DNA : sequencing Highly parallelized DNA-sequencing technologies that produce many hundreds of thousands or millions of short reads (25–500 bp) for a low cost and in a short time.
 +
;Exome : The subset of a genome that is protein coding. In addition to the exome, commercially available capture probes target non-coding exons, sequences flanking exons and microRNAs.
 +
;Homozygosity mapping : Narrowing down the location of a gene underlying a trait by searching for regions of the genome in which both chromosomal segments are inherited identicallyby-descent.
 +
 
;sequence depth : for a given genome, each base has on average been sequenced ''n'' number of times:
 
;sequence depth : for a given genome, each base has on average been sequenced ''n'' number of times:
 
:Coverage = (Nb of Reads)*(Read Length) / (Genome Size)
 
:Coverage = (Nb of Reads)*(Read Length) / (Genome Size)
Line 46: Line 56:
 
;monogenic : simple and rare diseases
 
;monogenic : simple and rare diseases
 
;multigenic : complex and common diseases
 
;multigenic : complex and common diseases
 +
;aneusomy : the condition in which an organism is made up of cells that contain different numbers of chromosomes.
 +
 +
;RPKM : Reads Per Kilobase of exon model per Million mapped reads. Defined as:<ref>Morin RD, O'Connor MD, Griffith M, Kuchenbauer F, Delaney A, Prabhu AL, Zhao
 +
Y, McDonald H, Zeng T, Hirst M, Eaves CJ, Marra MA (2008). "Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells". ''Genome Res.'', '''18'''(4):610-21. PMID 18285502. PMCID: PMC2279248.</ref>
 +
::RPKM = totalExonReads / [ mappedReads(millions) * exonLengh(KB) ]
 +
:where,
 +
:;totalExonReads : the number in the column with header Total exon reads in the row for the gene. This is the number of reads that have been mapped to a region in which an exon is annotated for the gene or across the boundaries of two exons or an intron and an exon for an annotated transcript of the gene. For eukaryotes, exons and their internal relationships are defined by annotations of type mRNA. 
 +
:;mappedReads : the sum of all the numbers in the column with header Total gene reads. The Total gene reads for a gene is the total number of reads that after mapping have been mapped to the region of the gene. Thus this includes all the reads uniquely mapped to the region of the gene as well as those of the reads which match in more places that have been allocated to this gene's region. A gene's region is that comprised of the flanking regions (if it was specified in figure 18.127), the exons, the introns and across exon-exon boundaries of all transcripts annotated for the gene. Thus, the sum of the total gene reads numbers is the number of mapped reads for the sample.
 +
:;exonLength : the number in the column with the header Exon length in the row for the gene, divided by 1000. This is calculated as the sum of the lengths of all exons annotated for the gene. Each exon is included only once in this sum, even if it is present in more annotated transcripts for the gene. Partly overlapping exons will count with their full length, even though they share the same region.
 +
 +
==High-throughput sequencing / next-generation sequencing==
 +
===Illumina (Solexa) sequencing===
 +
Solexa, now part of Illumina, developed a sequencing technology based on reversible dye-terminators. DNA molecules are first attached to primers on a slide and amplified so that local clonal colonies are formed (bridge amplification). Four types of reversible terminator bases (RT-bases) are added, and non-incorporated nucleotides are washed away. Unlike pyrosequencing, the DNA can only be extended one nucleotide at a time. A camera takes images of the fluorescently labeled nucleotides, then the dye along with the terminal 3' blocker is chemically removed from the DNA, allowing the next cycle.<ref name="pmid18576944">Mardis ER (2008). "Next-generation DNA sequencing methods". ''Annu Rev Genomics Hum Genet'', '''9''': 387–402. PMID 18576944. {{doi|10.1146/annurev.genom.9.081307.164359}}.</ref>
 +
 +
===Sequencing by hybridization===
 +
see: [[wikipedia:Sequencing by hybridization]]
 +
Sequencing by hybridization is a non-enzymatic method that uses a DNA microarray. A single pool of DNA whose sequence is to be determined is fluorescently labeled and hybridized to an array containing known sequences. Strong hybridization signals from a given spot on the array identifies its sequence in the DNA being sequenced.<ref>Hanna GJ, Johnson VA, Kuritzkes DR, Richman DD, Martinez-Picado J, Sutton L, Hazelwood JD, d'Aquila RT (1 July 2000). "Comparison of sequencing by hybridization and cycle sequencing for genotyping of Human Immunodeficiency Virus Type 1 Reverse Transcriptase". ''J. Clin. Microbiol.'', '''38'''(7): 2715–2721. PMID 10878069.</ref> Mass spectrometry may be used to determine mass differences between DNA fragments produced in chain-termination reactions.<ref>Edwards JR, Ruparel H, Ju J (2005). "Mass-spectrometry DNA sequencing". ''Mutation Research'', '''573'''(1–2): 3–12. PMID 15829234. {{doi|10.1016/j.mrfmmm.2004.07.021}}.</ref>
  
 
==See also==
 
==See also==
 +
*[[wikipedia:dbSNP]]
 +
*[[wikipedia:1000 Genomes Project]]
 
*[http://bgiamericas.com/scientific-expertise/collaborative-projects/ List of Collaborative Genome Projects]
 
*[http://bgiamericas.com/scientific-expertise/collaborative-projects/ List of Collaborative Genome Projects]
 
*[http://pbil.univ-lyon1.fr/software/seqinr/home.php?lang=eng seqinR] &mdash; package for the [[R programming language|R]] environment is a library of utilities to retrieve and analyse biological sequences.
 
*[http://pbil.univ-lyon1.fr/software/seqinr/home.php?lang=eng seqinR] &mdash; package for the [[R programming language|R]] environment is a library of utilities to retrieve and analyse biological sequences.
 
*[[wikipedia:Exome sequencing]]
 
*[[wikipedia:Exome sequencing]]
 
*[[wikipedia:Genome-wide association studies]] (GWASs)
 
*[[wikipedia:Genome-wide association studies]] (GWASs)
*[[wikipedia:Copy-number variation]]
+
*[[wikipedia:Copy-number variation]] (CNV)
 +
*[[wikipedia:De novo transcriptome assembly]]
 
*[[wikipedia:International HapMap Project]]
 
*[[wikipedia:International HapMap Project]]
 
*homozygosity mapping
 
*homozygosity mapping
Line 58: Line 88:
 
*[[wikipedia:Indel]]
 
*[[wikipedia:Indel]]
 
*[[wikipedia:Missense mutation]]
 
*[[wikipedia:Missense mutation]]
 +
*minor allele frequency (MAF)
 +
*[[wikipedia:Array-comparative genomic hybridization]] (array CGH)
 +
*[[wikipedia:Fisher's exact test]]
 +
*[[wikipedia:Segmental duplication]] / [[wikipedia:Low copy repeats]]
 +
*whole genome assembly comparison (WGAC) and whole genome shotgun sequence detection (WSSD)
 +
*[[wikipedia:Non allelic homologous recombination]]
 +
*[[FASTQ format]]
 +
*[http://genome.ucsc.edu/FAQ/FAQformat.html UCSC Data File Formats FAQ]
 +
*[http://biopython.org/wiki/Multiple_Alignment_Format Multiple Alignment Format] @Biopython
 +
*[http://samtools.sourceforge.net/ SAMtools] &mdash; SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments.
  
 
==References==
 
==References==
Line 70: Line 110:
 
*[http://www.nhlbi.nih.gov/resources/geneticsgenomics/programs/mendelian.htm Mendelian Exome Sequencing Project (Mendelian Exome)]
 
*[http://www.nhlbi.nih.gov/resources/geneticsgenomics/programs/mendelian.htm Mendelian Exome Sequencing Project (Mendelian Exome)]
 
*[http://www.ncbi.nlm.nih.gov/omim Online Mendelian Inheritance in Man] (OMIM) &mdash; @NCBI
 
*[http://www.ncbi.nlm.nih.gov/omim Online Mendelian Inheritance in Man] (OMIM) &mdash; @NCBI
*[[wikipedia:exome]]
+
*[http://www.hgmd.cf.ac.uk/ac/index.php The Human Gene Mutation Database]
 +
*[http://www.clcbio.com/index.php?id=1330 Manual @CLC Genomics Workbench]
  
 
[[Category:Bioinformatics]]
 
[[Category:Bioinformatics]]

Latest revision as of 21:46, 17 July 2012

The Exome Project

The National Heart, Lung, and Blood Institute (NHLBI) and National Human Genome Research Institute (NHGRI) have funded a new program known as the Exome Project. The goal of this project is to develop cost-effective, high-throughput sequencing of the protein coding regions of the human genome for application in well-phenotyped populations. Three groups are currently funded to test and implement approaches in four key areas — sample preparation, target capture, sequencing, and data management and analysis — to generate an integrated resequencing pipeline with the potential to reduce the cost of exome analysis.[1]
ABSTRACT: Exome sequencing — the targeted sequencing of the subset of the human genome that is protein coding — is a powerful and cost-effective new tool for dissecting the genetic basis of diseases and traits that have proved to be intractable to conventional gene-discovery strategies. Over the past 2 years, experimental and analytical approaches relating to exome sequencing have established a rich framework for discovering the genes underlying unsolved Mendelian disorders. Additionally, exome sequencing is being adapted to explore the extent to which rare alleles explain the heritability of complex diseases and healthrelated traits. These advances also set the stage for applying exome and whole-genome sequencing to facilitate clinical diagnosis and personalized disease-risk profiling.[2]

Background

The exome is the part of the genome formed by exons, the coding portions of genes that are expressed. Providing the genetic blueprint used in the synthesis of proteins and other functional gene products, the exome is the most functionally relevant part of the genome, and, therefore, the most likely to contribute to the phenotype of an organism. The exome of the human genome consists of roughly 180,000 exons constituting about 1% of the total genome or about 30 megabases of DNA.[3] Though comprising a very small fraction of the genome, mutations in the exome are thought to harbor 85% of disease-causing mutations.[4] Exome sequencing has proved to be an efficient strategy to determine the genetic basis of more than a two dozen Mendelian or single gene disorders.[2]

Examples of research projects using exome sequencing include the nonprofit Personal Genome Project (PGP), the NIH-funded Exome Project, the NHGRI-funded Mendelian Exome Project, the NHLBI Grand Opportunity Exome Sequencing Project and the microarray-based Nimblegen SeqCap EZ Exome from Roche Applied Science.

Current Exome Project Participants

  • Broad Institute
    • Stacey Gabriel
    • Chad Nusbaum
  • Harvard Medical School
    • George Church
    • Jonathan Seidman
    • Kun Zhang
  • University of Washington
    • Deborah Nickerson
    • Jay Shendure
    • Phil Green
    • Evan Eichler
  • NHLBI
    • Weiniu Gan
    • Alan Michelson
    • Deborah Applebaum-Bowden
  • NHGRI
    • Lu Wang

Glossary

Mendelian disorders 
Phenotypes caused by a mutation (or mutations) in a single gene and inherited in a dominant, recessive or X-linked pattern.
Penetrance 
The proportion of individuals with a specific phenotype among carriers of a particular genotype.
Locus heterogeneity 
The appearance of phenotypically similar characteristics resulting from mutations at different genetic loci. Differences in effect size or in replication between studies and samples are often ascribed to different loci leading to the same disease.
Genome-wide association studies (GWASs)
Studies that search for a population association between a phenotype and a particular allele by screening loci (most commonly by genotyping SNPs) across the entire genome.
Complex traits 
Traits that are influenced by the environment and/or through a combination of variants in at least several genes, each of which has a small effect.
Heritability 
The proportion of the total phenotypic variation in a given characteristic that can be attributed to additive genetic effects.
Next-generation DNA 
sequencing Highly parallelized DNA-sequencing technologies that produce many hundreds of thousands or millions of short reads (25–500 bp) for a low cost and in a short time.
Exome 
The subset of a genome that is protein coding. In addition to the exome, commercially available capture probes target non-coding exons, sequences flanking exons and microRNAs.
Homozygosity mapping 
Narrowing down the location of a gene underlying a trait by searching for regions of the genome in which both chromosomal segments are inherited identicallyby-descent.
sequence depth 
for a given genome, each base has on average been sequenced n number of times:
Coverage = (Nb of Reads)*(Read Length) / (Genome Size)
Sequencing depth represents the (often average) number of nucleotides contributing to a portion of an assembly. On a genome basis, it means that, on average, each base has been sequenced a certain number of times (10X, 20X...). For a specific nucleotide, it represents the number of sequences that added information about that nucleotide. Such depth varies quite a lot depending on the genomic region. In consequence, an average sequencing depth of 30X leaves a lot of small portions of a genome un-sequenced while other receive a lot more sequences.
coverage 
appears to have 3 meanings:
  1. the theoretical "fold-coverage" of a shotgun sequencing experiment: number of reads * read length / target size
  2. the theoretical or empirical "breadth-of-coverage" of an assembly: assembly size / target size
  3. the empirical average "depth-of-coverage" of an assembly: number of reads * read length / assembly size
(1) and (3) are not the same because of sequencing error and un-clonable/un-mappable regions of the genome. Lander-Waterman theory deals with the relationship between (1) and (2).
see: here for more info.
monogenic 
simple and rare diseases
multigenic 
complex and common diseases
aneusomy 
the condition in which an organism is made up of cells that contain different numbers of chromosomes.
RPKM 
Reads Per Kilobase of exon model per Million mapped reads. Defined as:[5]
RPKM = totalExonReads / [ mappedReads(millions) * exonLengh(KB) ]
where,
totalExonReads 
the number in the column with header Total exon reads in the row for the gene. This is the number of reads that have been mapped to a region in which an exon is annotated for the gene or across the boundaries of two exons or an intron and an exon for an annotated transcript of the gene. For eukaryotes, exons and their internal relationships are defined by annotations of type mRNA.
mappedReads 
the sum of all the numbers in the column with header Total gene reads. The Total gene reads for a gene is the total number of reads that after mapping have been mapped to the region of the gene. Thus this includes all the reads uniquely mapped to the region of the gene as well as those of the reads which match in more places that have been allocated to this gene's region. A gene's region is that comprised of the flanking regions (if it was specified in figure 18.127), the exons, the introns and across exon-exon boundaries of all transcripts annotated for the gene. Thus, the sum of the total gene reads numbers is the number of mapped reads for the sample.
exonLength 
the number in the column with the header Exon length in the row for the gene, divided by 1000. This is calculated as the sum of the lengths of all exons annotated for the gene. Each exon is included only once in this sum, even if it is present in more annotated transcripts for the gene. Partly overlapping exons will count with their full length, even though they share the same region.

High-throughput sequencing / next-generation sequencing

Illumina (Solexa) sequencing

Solexa, now part of Illumina, developed a sequencing technology based on reversible dye-terminators. DNA molecules are first attached to primers on a slide and amplified so that local clonal colonies are formed (bridge amplification). Four types of reversible terminator bases (RT-bases) are added, and non-incorporated nucleotides are washed away. Unlike pyrosequencing, the DNA can only be extended one nucleotide at a time. A camera takes images of the fluorescently labeled nucleotides, then the dye along with the terminal 3' blocker is chemically removed from the DNA, allowing the next cycle.[6]

Sequencing by hybridization

see: wikipedia:Sequencing by hybridization

Sequencing by hybridization is a non-enzymatic method that uses a DNA microarray. A single pool of DNA whose sequence is to be determined is fluorescently labeled and hybridized to an array containing known sequences. Strong hybridization signals from a given spot on the array identifies its sequence in the DNA being sequenced.[7] Mass spectrometry may be used to determine mass differences between DNA fragments produced in chain-termination reactions.[8]

See also

References

  1. The Exome Project — from the Genome Sciences Dept. at the University of Washington.
  2. 2.0 2.1 Bamshad, MJ; Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, Shendure J (27 September 2011). "Exome sequencing as a tool for Mendelian disease gene discovery". Nature Reviews Genetics. 11(12): 745-755. PMID 21946919. DOI:10.1038/nrg3031 .
  3. Ng, SB; Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, Shendure J. (9/10/2009). "Targeted capture and massively parallel sequencing of 12 human exomes". Nature, 7261(461): 272-276. DOI:10.1038/nature08250 .
  4. Choia, M; Scholl UI, Ji W, Liu T, Tikhonova IR, Zumbo P, Nayir A, Bakkaloğlu A, Ozen S, Sanjad S, Nelson-Williams C, Farhi A, Mane S, Lifton RP (10 November 2009). "Genetic diagnosis by whole exome capture and massively parallel DNA sequencing". PNAS, 45(106): 19096-19101. DOI:10.1073/pnas.0910672106 .
  5. Morin RD, O'Connor MD, Griffith M, Kuchenbauer F, Delaney A, Prabhu AL, Zhao Y, McDonald H, Zeng T, Hirst M, Eaves CJ, Marra MA (2008). "Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells". Genome Res., 18(4):610-21. PMID 18285502. PMCID: PMC2279248.
  6. Mardis ER (2008). "Next-generation DNA sequencing methods". Annu Rev Genomics Hum Genet, 9: 387–402. PMID 18576944. DOI:10.1146/annurev.genom.9.081307.164359 .
  7. Hanna GJ, Johnson VA, Kuritzkes DR, Richman DD, Martinez-Picado J, Sutton L, Hazelwood JD, d'Aquila RT (1 July 2000). "Comparison of sequencing by hybridization and cycle sequencing for genotyping of Human Immunodeficiency Virus Type 1 Reverse Transcriptase". J. Clin. Microbiol., 38(7): 2715–2721. PMID 10878069.
  8. Edwards JR, Ruparel H, Ju J (2005). "Mass-spectrometry DNA sequencing". Mutation Research, 573(1–2): 3–12. PMID 15829234. DOI:10.1016/j.mrfmmm.2004.07.021 .

Further reading

  • Chakravarti A (2011). "Genomic contributions to Mendelian disease". Genome Res. 21: 643-644. DOI:10.1101/gr.123554.111 .

External links