b783f9e658057de6ade55b8d8b8932c9e6d606c8 brianlee Fri Mar 11 14:13:58 2022 -0800 At b0b's request revising the Platinum Genomes Track Description page, no RM diff --git src/hg/makeDb/trackDb/human/platinumGenomes.html src/hg/makeDb/trackDb/human/platinumGenomes.html index 0f9efd4..da191f9 100644 --- src/hg/makeDb/trackDb/human/platinumGenomes.html +++ src/hg/makeDb/trackDb/human/platinumGenomes.html @@ -1,55 +1,73 @@
-Improvement of variant calling in next-generation sequence data requires -a comprehensive, genome-wide catalog of high-confidence variants called in -a set of genomes for use as a benchmark. We generated deep, whole-genome -sequence data of 17 individuals in a three-generation pedigree and called -variants in each genome using a range of currently available algorithms. -We used haplotype transmission information to create a phased "Platinum" -variant catalog of 4.7 million single-nucleotide variants (SNVs) -plus 0.7 million small (1-50 bp) insertions and deletions (indels) that are -consistent with the pattern of inheritance in the parents and 11 children -of this pedigree. Platinum genotypes are highly concordant with the current -catalog of the National Institute of Standards and Technology for -both SNVs (>99.99%) and indels (99.92%) and add a validated truth catalog -that has 26% more SNVs and 45% more indels. Analysis of 334,652 SNVs that -were consistent between informatics pipelines yet inconsistent with haplotype -transmission ("nonplatinum") revealed that the majority of these variants -are de novo and cell-line mutations or reside within previously unidentified -duplications and deletions. The reference materials from this study are a -resource for objective assessment of the accuracy of variant calls -throughout genomes. -
- +These tracks shows high-confidence "Platinum Genome" variant calls for two individuals, +NA12877 and NA12878, part of a sequenced 17 member pedigree for family number +1463, from the Centre d'Etude du Polymorphisme Humain (CEPH). The hybrid +track displays a merging of the NA12878 results with variant calls produced by Genome in a +Bottle, discussed further below. CEPH is an international genetic research center that provides +a resource of immortalised cell cultures used to map genetic markers, and pedigree 1463 +represents a family lineage from Utah of four grandparents, two parents, and 11 children. +The whole pedigree was sequenced to 50x depth on a HiSeq 2000 Illumina system, which is +considered a platinum standard, where platinum refers to the quality and completeness of +the resulting assembly, such as providing full chromosome scaffolds with phasing and +haplotypes resolved across the entire genome. +-The 'hybrid' truthsets were generated by merging Genome in a Bottle -high confidence calls (hg001, v3.3.2) with those from the Platinum -Genomes truthset for the same sample (NA12878, v2017-1.0). Merged -records were validated by performing a k-mer test on alignments from -the lower pedigree CEPH 1463 (11 children). Records with k-mer support -via haplotype inheritance were added to the hybrid truthset. -
+This figure depicts the pedigree of the family sequenced for this study, where the ID for each +sample is defined by adding the prefix NA128 to each numbered individual, so that 77 = NA12877 +and 78 = NA12878, corresponding to the VCF tracks available in this track set. The dark orange +individuals indicate sequences used in the analysis methods, whereas the blue represent the +founder generations (grandparents), which were also sequenced and used in validation steps. +The genomes of the parent child trio on the top right side, 91-92-78, were also sequenced +during Phase I of the 1000 Genomes Project. ++These tracks represent a comprehensive genome-wide set of phased small variants that has been +validated to high confidence. Sequencing and phasing a larger pedigree, beyond the two parents +and one child, increases the ability to detect errors and assess the accuracy of more of the +variants compared to a standard trio analysis. The genetic inheritance data enables creating a more +comprehensive catalog of "platinum variants" that reflects both high accuracy and +completeness. These results are significant as a comprehensive set of valid +single-nucleotide variants (SNVs) and insertions and deletions (indels), +in both the easy and difficult parts of the genome, provides a vital resource for software +developers creating the next generation of variant callers, because these are the areas where +the current methods most need training data to improve their methods. Since every one of the +variants in this catalog is phased, this data set provides a resource to better assess emerging +technologies designed to generate valid phasing information. To generate the calls, six analysis +pipelines to call SNVs and indels were used, and merged into one catalog, where sensitivity of +the genetic inheritance aided to detect genotyping errors and maximize the chance of only +including true variants, that might otherwise be removed by suboptimal filtering. Read more +about the detailed methods in the referenced paper, further describing this variant catalog +of 4.7 million SNVs plus 0.7 million small (1-50 bp) indels, that are all consistent with +the pattern of inheritance in the parents and 11 children of this pedigree.
++The hybrid track in this set extends the characterisation of NA12878 +by incorporating high confidence calls produced by Genome in a Bottle analysis. +The resulting merged files contain more comprehensive coverage of variation than either +set independently, for instance the hg19 version contains over 80,000 more indels than +either input set. Read more about the hybrid methods at the following link: +https://github.com/Illumina/PlatinumGenomes/wiki/Hybrid-truthset
The VCF files for this track can be obtained from the download server:
https://hgdownload.soe.ucsc.edu/gbdb/$db/platinumGenomes/.
These files were obtained from the Platinum genomes source archive:
https://s3.eu-central-1.amazonaws.com/platinum-genomes/2017-1.0/ReleaseNotes.txt.
Eberle MA, Fritzilas E, Krusche P, Källberg M, Moore BL, Bekritsky MA, Iqbal Z, Chuang HY, Humphray SJ, Halpern AL et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 2017 Jan;27(1):157-164. PMID: 27903644; PMC: PMC5204340