Difference between revisions of "Exome Project"

From Christoph's Personal Wiki
Jump to: navigation, search
(Glossary)
Line 56: Line 56:
 
;monogenic : simple and rare diseases
 
;monogenic : simple and rare diseases
 
;multigenic : complex and common diseases
 
;multigenic : complex and common diseases
 +
 +
;RPKM : Reads Per Kilobase of exon model per Million mapped reads. Defined as:<ref>Morin RD, O'Connor MD, Griffith M, Kuchenbauer F, Delaney A, Prabhu AL, Zhao
 +
Y, McDonald H, Zeng T, Hirst M, Eaves CJ, Marra MA (2008). "Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells". ''Genome Res.'', '''18'''(4):610-21. PMID: 18285502. PMCID: PMC2279248.</ref>
 +
::RPKM = totalExonReads / [ mappedReads(millions) * exonLengh(KB) ]
 +
:where,
 +
:;totalExonReads : the number in the column with header Total exon reads in the row for the gene. This is the number of reads that have been mapped to a region in which an exon is annotated for the gene or across the boundaries of two exons or an intron and an exon for an annotated transcript of the gene. For eukaryotes, exons and their internal relationships are defined by annotations of type mRNA. 
 +
:;mappedReads : the sum of all the numbers in the column with header Total gene reads. The Total gene reads for a gene is the total number of reads that after mapping have been mapped to the region of the gene. Thus this includes all the reads uniquely mapped to the region of the gene as well as those of the reads which match in more places (below the limit set in the dialog in figure 18.127) that have been allocated to this gene's region. A gene's region is that comprised of the flanking regions (if it was specified in figure 18.127), the exons, the introns and across exon-exon boundaries of all transcripts annotated for the gene. Thus, the sum of the total gene reads numbers is the number of mapped reads for the sample. This number can be found in the RNA-seq report's table 3.1, in the 'Total' entry of the row 'Counted fragments'. (The term 'fragment' is used in place of the term 'read', because if you analyze paired reads and have chosen the 'Default counting scheme' it is 'fragments' that is counted, rather than reads (two reads in a pair will be counted as one fragment).
 +
:;exonLength : the number in the column with the header Exon length in the row for the gene, divided by 1000. This is calculated as the sum of the lengths of all exons annotated for the gene. Each exon is included only once in this sum, even if it is present in more annotated transcripts for the gene. Partly overlapping exons will count with their full length, even though they share the same region.
  
 
==See also==
 
==See also==

Revision as of 21:17, 16 July 2012

The Exome Project

The National Heart, Lung, and Blood Institute (NHLBI) and National Human Genome Research Institute (NHGRI) have funded a new program known as the Exome Project. The goal of this project is to develop cost-effective, high-throughput sequencing of the protein coding regions of the human genome for application in well-phenotyped populations. Three groups are currently funded to test and implement approaches in four key areas — sample preparation, target capture, sequencing, and data management and analysis — to generate an integrated resequencing pipeline with the potential to reduce the cost of exome analysis.[1]
ABSTRACT: Exome sequencing — the targeted sequencing of the subset of the human genome that is protein coding — is a powerful and cost-effective new tool for dissecting the genetic basis of diseases and traits that have proved to be intractable to conventional gene-discovery strategies. Over the past 2 years, experimental and analytical approaches relating to exome sequencing have established a rich framework for discovering the genes underlying unsolved Mendelian disorders. Additionally, exome sequencing is being adapted to explore the extent to which rare alleles explain the heritability of complex diseases and healthrelated traits. These advances also set the stage for applying exome and whole-genome sequencing to facilitate clinical diagnosis and personalized disease-risk profiling.[2]

Background

The exome is the part of the genome formed by exons, the coding portions of genes that are expressed. Providing the genetic blueprint used in the synthesis of proteins and other functional gene products, the exome is the most functionally relevant part of the genome, and, therefore, the most likely to contribute to the phenotype of an organism. The exome of the human genome consists of roughly 180,000 exons constituting about 1% of the total genome or about 30 megabases of DNA.[3] Though comprising a very small fraction of the genome, mutations in the exome are thought to harbor 85% of disease-causing mutations.[4] Exome sequencing has proved to be an efficient strategy to determine the genetic basis of more than a two dozen Mendelian or single gene disorders.[2]

Examples of research projects using exome sequencing include the nonprofit Personal Genome Project (PGP), the NIH-funded Exome Project, the NHGRI-funded Mendelian Exome Project, the NHLBI Grand Opportunity Exome Sequencing Project and the microarray-based Nimblegen SeqCap EZ Exome from Roche Applied Science.

Current Exome Project Participants

  • Broad Institute
    • Stacey Gabriel
    • Chad Nusbaum
  • Harvard Medical School
    • George Church
    • Jonathan Seidman
    • Kun Zhang
  • University of Washington
    • Deborah Nickerson
    • Jay Shendure
    • Phil Green
    • Evan Eichler
  • NHLBI
    • Weiniu Gan
    • Alan Michelson
    • Deborah Applebaum-Bowden
  • NHGRI
    • Lu Wang

Glossary

Mendelian disorders 
Phenotypes caused by a mutation (or mutations) in a single gene and inherited in a dominant, recessive or X-linked pattern.
Penetrance 
The proportion of individuals with a specific phenotype among carriers of a particular genotype.
Locus heterogeneity 
The appearance of phenotypically similar characteristics resulting from mutations at different genetic loci. Differences in effect size or in replication between studies and samples are often ascribed to different loci leading to the same disease.
Genome-wide association studies (GWASs)
Studies that search for a population association between a phenotype and a particular allele by screening loci (most commonly by genotyping SNPs) across the entire genome.
Complex traits 
Traits that are influenced by the environment and/or through a combination of variants in at least several genes, each of which has a small effect.
Heritability 
The proportion of the total phenotypic variation in a given characteristic that can be attributed to additive genetic effects.
Next-generation DNA 
sequencing Highly parallelized DNA-sequencing technologies that produce many hundreds of thousands or millions of short reads (25–500 bp) for a low cost and in a short time.
Exome 
The subset of a genome that is protein coding. In addition to the exome, commercially available capture probes target non-coding exons, sequences flanking exons and microRNAs.
Homozygosity mapping 
Narrowing down the location of a gene underlying a trait by searching for regions of the genome in which both chromosomal segments are inherited identicallyby-descent.
sequence depth 
for a given genome, each base has on average been sequenced n number of times:
Coverage = (Nb of Reads)*(Read Length) / (Genome Size)
Sequencing depth represents the (often average) number of nucleotides contributing to a portion of an assembly. On a genome basis, it means that, on average, each base has been sequenced a certain number of times (10X, 20X...). For a specific nucleotide, it represents the number of sequences that added information about that nucleotide. Such depth varies quite a lot depending on the genomic region. In consequence, an average sequencing depth of 30X leaves a lot of small portions of a genome un-sequenced while other receive a lot more sequences.
coverage 
appears to have 3 meanings:
  1. the theoretical "fold-coverage" of a shotgun sequencing experiment: number of reads * read length / target size
  2. the theoretical or empirical "breadth-of-coverage" of an assembly: assembly size / target size
  3. the empirical average "depth-of-coverage" of an assembly: number of reads * read length / assembly size
(1) and (3) are not the same because of sequencing error and un-clonable/un-mappable regions of the genome. Lander-Waterman theory deals with the relationship between (1) and (2).
see: here for more info.
monogenic 
simple and rare diseases
multigenic 
complex and common diseases
RPKM 
Reads Per Kilobase of exon model per Million mapped reads. Defined as:[5]
RPKM = totalExonReads / [ mappedReads(millions) * exonLengh(KB) ]
where,
totalExonReads 
the number in the column with header Total exon reads in the row for the gene. This is the number of reads that have been mapped to a region in which an exon is annotated for the gene or across the boundaries of two exons or an intron and an exon for an annotated transcript of the gene. For eukaryotes, exons and their internal relationships are defined by annotations of type mRNA.
mappedReads 
the sum of all the numbers in the column with header Total gene reads. The Total gene reads for a gene is the total number of reads that after mapping have been mapped to the region of the gene. Thus this includes all the reads uniquely mapped to the region of the gene as well as those of the reads which match in more places (below the limit set in the dialog in figure 18.127) that have been allocated to this gene's region. A gene's region is that comprised of the flanking regions (if it was specified in figure 18.127), the exons, the introns and across exon-exon boundaries of all transcripts annotated for the gene. Thus, the sum of the total gene reads numbers is the number of mapped reads for the sample. This number can be found in the RNA-seq report's table 3.1, in the 'Total' entry of the row 'Counted fragments'. (The term 'fragment' is used in place of the term 'read', because if you analyze paired reads and have chosen the 'Default counting scheme' it is 'fragments' that is counted, rather than reads (two reads in a pair will be counted as one fragment).
exonLength 
the number in the column with the header Exon length in the row for the gene, divided by 1000. This is calculated as the sum of the lengths of all exons annotated for the gene. Each exon is included only once in this sum, even if it is present in more annotated transcripts for the gene. Partly overlapping exons will count with their full length, even though they share the same region.

See also

References

  1. The Exome Project — from the Genome Sciences Dept. at the University of Washington.
  2. 2.0 2.1 Bamshad, MJ; Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, Shendure J (27 September 2011). "Exome sequencing as a tool for Mendelian disease gene discovery". Nature Reviews Genetics. 11(12): 745-755. PMID 21946919. DOI:10.1038/nrg3031 .
  3. Ng, SB; Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, Shendure J. (9/10/2009). "Targeted capture and massively parallel sequencing of 12 human exomes". Nature, 7261(461): 272-276. DOI:10.1038/nature08250 .
  4. Choia, M; Scholl UI, Ji W, Liu T, Tikhonova IR, Zumbo P, Nayir A, Bakkaloğlu A, Ozen S, Sanjad S, Nelson-Williams C, Farhi A, Mane S, Lifton RP (10 November 2009). "Genetic diagnosis by whole exome capture and massively parallel DNA sequencing". PNAS, 45(106): 19096-19101. DOI:10.1073/pnas.0910672106 .
  5. Morin RD, O'Connor MD, Griffith M, Kuchenbauer F, Delaney A, Prabhu AL, Zhao Y, McDonald H, Zeng T, Hirst M, Eaves CJ, Marra MA (2008). "Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells". Genome Res., 18(4):610-21. PMID: 18285502. PMCID: PMC2279248.

Further reading

  • Chakravarti A (2011). "Genomic contributions to Mendelian disease". Genome Res. 21: 643-644. DOI:10.1101/gr.123554.111 .

External links