b783f9e658057de6ade55b8d8b8932c9e6d606c8 brianlee Fri Mar 11 14:13:58 2022 -0800 At b0b's request revising the Platinum Genomes Track Description page, no RM diff --git src/hg/makeDb/trackDb/human/platinumGenomes.html src/hg/makeDb/trackDb/human/platinumGenomes.html index 0f9efd4..da191f9 100644 --- src/hg/makeDb/trackDb/human/platinumGenomes.html +++ src/hg/makeDb/trackDb/human/platinumGenomes.html @@ -1,55 +1,73 @@ <h2>Abstract</h2> - <p> -Improvement of variant calling in next-generation sequence data requires -a comprehensive, genome-wide catalog of high-confidence variants called in -a set of genomes for use as a benchmark. We generated deep, whole-genome -sequence data of 17 individuals in a three-generation pedigree and called -variants in each genome using a range of currently available algorithms. -We used haplotype transmission information to create a phased "Platinum" -variant catalog of 4.7 million single-nucleotide variants (SNVs) -plus 0.7 million small (1-50 bp) insertions and deletions (indels) that are -consistent with the pattern of inheritance in the parents and 11 children -of this pedigree. Platinum genotypes are highly concordant with the current -catalog of the National Institute of Standards and Technology for -both SNVs (>99.99%) and indels (99.92%) and add a validated truth catalog -that has 26% more SNVs and 45% more indels. Analysis of 334,652 SNVs that -were consistent between informatics pipelines yet inconsistent with haplotype -transmission ("nonplatinum") revealed that the majority of these variants -are de novo and cell-line mutations or reside within previously unidentified -duplications and deletions. The reference materials from this study are a -resource for objective assessment of the accuracy of variant calls -throughout genomes. -</p> - +These tracks shows high-confidence "Platinum Genome" variant calls for two individuals, +NA12877 and NA12878, part of a sequenced 17 member pedigree for family number +<a href="https://catalog.coriell.org/0/Sections/Collections/NIGMS/CEPHFamiliesDetail.aspx?PgId=441&fam=1463" +target="_blank">1463</a>, from the Centre d'Etude du Polymorphisme Humain (CEPH). The hybrid +track displays a merging of the NA12878 results with variant calls produced by Genome in a +Bottle, discussed further below. CEPH is an international genetic research center that provides +a resource of immortalised cell cultures used to map genetic markers, and pedigree 1463 +represents a family lineage from Utah of four grandparents, two parents, and 11 children. +The whole pedigree was sequenced to 50x depth on a HiSeq 2000 Illumina system, which is +considered a platinum standard, where platinum refers to the quality and completeness of +the resulting assembly, such as providing full chromosome scaffolds with phasing and +haplotypes resolved across the entire genome.</p> +<p><img class="text-center" src="/images/platinumTree.jpg" width="400px"></p> <p> -The 'hybrid' truthsets were generated by merging Genome in a Bottle -high confidence calls (hg001, v3.3.2) with those from the Platinum -Genomes truthset for the same sample (NA12878, v2017-1.0). Merged -records were validated by performing a k-mer test on alignments from -the lower pedigree CEPH 1463 (11 children). Records with k-mer support -via haplotype inheritance were added to the hybrid truthset. -</p> +This figure depicts the pedigree of the family sequenced for this study, where the ID for each +sample is defined by adding the prefix NA128 to each numbered individual, so that 77 = NA12877 +and 78 = NA12878, corresponding to the VCF tracks available in this track set. The dark orange +individuals indicate sequences used in the analysis methods, whereas the blue represent the +founder generations (grandparents), which were also sequenced and used in validation steps. +The genomes of the parent child trio on the top right side, 91-92-78, were also sequenced +during Phase I of the 1000 Genomes Project.</p> +<p> +These tracks represent a comprehensive genome-wide set of phased small variants that has been +validated to high confidence. Sequencing and phasing a larger pedigree, beyond the two parents +and one child, increases the ability to detect errors and assess the accuracy of more of the +variants compared to a standard trio analysis. The genetic inheritance data enables creating a more +comprehensive catalog of "platinum variants" that reflects both high accuracy and +completeness. These results are significant as a comprehensive set of valid +single-nucleotide variants (SNVs) and insertions and deletions (indels), +in both the easy and difficult parts of the genome, provides a vital resource for software +developers creating the next generation of variant callers, because these are the areas where +the current methods most need training data to improve their methods. Since every one of the +variants in this catalog is phased, this data set provides a resource to better assess emerging +technologies designed to generate valid phasing information. To generate the calls, six analysis +pipelines to call SNVs and indels were used, and merged into one catalog, where sensitivity of +the genetic inheritance aided to detect genotyping errors and maximize the chance of only +including true variants, that might otherwise be removed by suboptimal filtering. Read more +about the detailed methods in the referenced paper, further describing this variant catalog +of 4.7 million SNVs plus 0.7 million small (1-50 bp) indels, that are all consistent with +the pattern of inheritance in the parents and 11 children of this pedigree.</p> +<p> +The hybrid track in this set extends the characterisation of NA12878 +by incorporating high confidence calls produced by Genome in a Bottle analysis. +The resulting merged files contain more comprehensive coverage of variation than either +set independently, for instance the hg19 version contains over 80,000 more indels than +either input set. Read more about the hybrid methods at the following link: +<a href="https://github.com/Illumina/PlatinumGenomes/wiki/Hybrid-truthset" +target="_blank">https://github.com/Illumina/PlatinumGenomes/wiki/Hybrid-truthset</a></p> <h2>Data Access</h2> <p> The VCF files for this track can be obtained from the download server: <a href="https://hgdownload.soe.ucsc.edu/gbdb/$db/platinumGenomes/" target=_blank> https://hgdownload.soe.ucsc.edu/gbdb/$db/platinumGenomes/</a>.<br> These files were obtained from the Platinum genomes source archive: <a href="https://s3.eu-central-1.amazonaws.com/platinum-genomes/2017-1.0/ReleaseNotes.txt" target=_blank>https://s3.eu-central-1.amazonaws.com/platinum-genomes/2017-1.0/ReleaseNotes.txt</a>. </p> <h2>Reference</h2> <p> Eberle MA, Fritzilas E, Krusche P, Källberg M, Moore BL, Bekritsky MA, Iqbal Z, Chuang HY, Humphray SJ, Halpern AL <em>et al</em>. <a href="https://genome.cshlp.org/content/27/1/157" target="_blank"> A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree</a>. <em>Genome Res</em>. 2017 Jan;27(1):157-164. PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/27903644" target="_blank">27903644</a>; PMC: <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5204340/" target="_blank">PMC5204340</a> </p>