9a31f94233618b5b6c16814d550efddc226d5340 kuhn Thu Oct 28 16:29:22 2021 -0700 minor wording changes diff --git src/hg/makeDb/trackDb/human/dbSnp153Composite.html src/hg/makeDb/trackDb/human/dbSnp153Composite.html index 09a717b..5398446 100644 --- src/hg/makeDb/trackDb/human/dbSnp153Composite.html +++ src/hg/makeDb/trackDb/human/dbSnp153Composite.html @@ -1,34 +1,34 @@ <h2>Description</h2> <p> This track shows short genetic variants (up to approximately 50 base pairs) from <A HREF="https://www.ncbi.nlm.nih.gov/SNP/" target=_blank>dbSNP</A> build 153: single-nucleotide variants (SNVs), -small insertions, deletions, and complex deletion/insertions, +small insertions, deletions, and complex deletion/insertions (indels), relative to the reference genome assembly. Most variants in dbSNP are rare, not true polymorphisms, and some variants are known to be pathogenic. </p><p> For hg38 (GRCh38), approximately 667 million distinct variants (RefSNP clusters with rs# ids) -have been mapped to over 702 million genomic locations +have been mapped to more than 702 million genomic locations including alternate haplotype and fix patch sequences. dbSNP remapped variants from hg38 to hg19 (GRCh37); approximately 658 million distinct variants were mapped to -over 683 million genomic locations +more than 683 million genomic locations including alternate haplotype and fix patch sequences (not all of which are included in UCSC's hg19). </p> <p> This track includes four subtracks of variants: <ul> <li><b>All dbSNP (153)</b>: the entire set (683 million for hg19, 702 million for hg38) </li> <li><b>Common dbSNP (153)</b>: approximately 15 million variants with a minor allele frequency (MAF) of at least 1% (0.01) in the 1000 Genomes Phase 3 dataset. Variants in the Mult. subset (below) are excluded. </li> <li><b>ClinVar dbSNP (153)</b>: approximately 455,000 variants mentioned in ClinVar. <b>Note:</b> that includes both benign and pathogenic (as well as uncertain) variants. Variants in the Mult. subset (below) are excluded. @@ -65,37 +65,37 @@ <p> SNVs and pure deletions are displayed as boxes covering the affected base(s). Pure insertions are drawn as single-pixel tickmarks between the base before and the base after the insertion. </p><p> Insertions and/or deletions in repetitive regions may be represented by a half-height box showing uncertainty in placement, followed by a full-height box showing the number of deleted bases, or a full-height tickmark to indicate an insertion. When an insertion or deletion falls in a repetitive region, the placement may be ambiguous. For example, if the reference genome contains "TAAAG" but some individuals have "TAAG" at the same location, then the variant is a deletion of a single A relative to the reference genome. However, which A was deleted? There is no way to tell whether the first, second or third A was removed. Different variant mapping tools may place the deletion at different bases in the reference genome. -In order to reduce errors in merging variant calls made with different left vs. right biases, +To reduce errors in merging variant calls made with different left vs. right biases, dbSNP made a major change in its representation of deletion/insertion variants in build 152. Now, instead of assigning a single-base genomic location at one of the A's, dbSNP expands the coordinates to encompass the whole repetitive region, so the variant is represented as a deletion of 3 A's combined with an insertion of 2 A's. In the track display, there will be a half-height box covering the first two A's, -followed by a full-height box covering the third A, in order to show a net loss of one base +followed by a full-height box covering the third A, to show a net loss of one base but an uncertain placement within the three A's. </p> <p> Variants are colored according to functional effect on genes annotated by dbSNP: </p> <p><b><font color=red>Protein-altering variants and splice site variants are red</font></b>. <br><b><font color=green>Synonymous codon variants are green</font></b>. <br><b><font color=blue> Non-coding transcript or Untranslated Region (UTR) variants are blue</font></b>. </p> <p> @@ -106,31 +106,31 @@ major/minor alleles (when available) and minor allele frequency (when available). Allele frequencies are reported independently by twelve projects (some of which may have overlapping sets of samples): <ul> <li><a href="https://www.internationalgenome.org/" target=_blank>1000Genomes</a>: The 1000 Genomes Phase 3 dataset contains data for 2,504 individuals from 26 populations. </li> <li><a href="https://gnomad.broadinstitute.org/" target=_blank>GnomAD exomes</a>: The gnomAD <a href="https://macarthurlab.org/2018/10/17/gnomad-v2-1/" target=_blank>v2.1</a> exome dataset comprises a total of 16 million SNVs and 1.2 million indels from 125,748 exomes in 14 populations. </li> <li><a href="https://www.nhlbiwgs.org/" target=_blank>TOPMED</a>: - The TOPMED dataset contains phase 3 data from freeze 5 panel that include over 60,000 + The TOPMED dataset contains phase 3 data from freeze 5 panel that include more than 60,000 individuals. The approximate ethnic breakdown is European(52%), African (31%), Hispanic or Latino (10%), and East Asian (7%) ancestry. </li> <li><a href="http://exac.broadinstitute.org/" target=_blank>ExAC</a>: The Exome Aggregation Consortium (ExAC) dataset contains 60,706 unrelated individuals sequenced as part of various disease-specific and population genetic studies. Individuals affected by severe pediatric disease have been removed. </li> <li><a href="https://www.pagestudy.org/" target=_blank>PAGE STUDY</a>: The PAGE Study: How Genetic Diversity Improves Our Understanding of the Architecture of Complex Traits. </li> <li><a href="https://gnomad.broadinstitute.org/" target=_blank>GnomAD genomes</a>: The gnomAD <a href="https://macarthurlab.org/2018/10/17/gnomad-v2-1/" target=_blank>v2.1</a> @@ -394,43 +394,43 @@ <td>At least one other mapping of this variant has erroneous coordinates. The mapping(s) with erroneous coordinates are excluded from this track and are included in the Map Err subtrack. Sometimes despite this mapping having legal coordinates, there may still be an issue with this mapping's coordinates and alleles; you may want to click through to dbSNP to compare the initial submission's coordinates and alleles. In hg19, 55454 distinct rsIDs are affected; in hg38, 86636. </tr> </table> <h2>Data Sources and Methods</h2> <p> dbSNP has collected genetic variant reports from researchers worldwide for <a href="https://ncbiinsights.ncbi.nlm.nih.gov/2019/10/07/dbsnp-celebrates-20-years/" - target=_blank>over 20 years</a>. + target=_blank>more than 20 years</a>. Since the advent of next-generation sequencing methods and the population sequencing efforts that they enable, dbSNP has grown exponentially, requiring a new data schema, computational pipeline, web infrastructure, and download files. (Holmes <em>et al.</em>) The same challenges of exponential growth affected UCSC's presentation of dbSNP variants, so we have taken the opportunity to change our internal representation and import pipeline. Most notably, flanking sequences are no longer provided by dbSNP, -since most submissions have been genomic variant calls in VCF format as opposed to +because most submissions have been genomic variant calls in VCF format as opposed to independent sequences. </p> <p> -We downloaded dbSNP's JSON files available from +We downloaded JSON files available from dbSNP at <a href="ftp://ftp.ncbi.nlm.nih.gov/snp/archive/b153/JSON/" target=_blank>ftp://ftp.ncbi.nlm.nih.gov/snp/archive/b153/JSON/</a>, extracted a subset of the information about each variant, and collated it into a bigBed file using the <a href="https://genome-source.gi.ucsc.edu/gitlist/kent.git/blob/master/src/hg/lib/bigDbSnp.as" target=_blank>bigDbSnp.as</a> schema with the information necessary for filtering and displaying the variants, as well as a separate file containing more detailed information to be displayed on each variant's details page (<a href="https://genome-source.gi.ucsc.edu/gitlist/kent.git/blob/master/src/hg/lib/dbSnpDetails.as" target=_blank>dbSnpDetails.as</a> schema). <h2>Data Access</h2> <p> Since dbSNP has grown to include approximately 700 million variants, the size of the All dbSNP (153) @@ -498,31 +498,31 @@ </tr> <tr> <td colspan=3> <a href="http://hgdownload.soe.ucsc.edu/gbdb/hgFixed/dbSnp/dbSnp153Details.tab.gz" target=_blank>dbSnp153Details.tab.gz</a> </td> <td>gzip-compressed tab-separated text</td> <td>Detailed variant properties, independent of genome assembly version</td> </tr> </table> </p> <p> Several utilities for working with bigBed-formatted binary files can be downloaded <a href="http://hgdownload.soe.ucsc.edu/downloads.html#utilities_downloads" target=_blank>here</a>. -Run a utility with no arguments in order to see a brief description of the utility and its options. +Run a utility with no arguments to see a brief description of the utility and its options. <ul> <li><b>bigBedInfo</b> provides summary statistics about a bigBed file including the number of items in the file. With the <b>-as</b> option, the output includes an autoSql definition of data columns, useful for interpreting the column values.</li> <li><b>bigBedToBed</b> converts the binary bigBed data to tab-separated text. Output can be restricted to a particular region by using the -chrom, -start and -end options.</li> <li><b>bigBedNamedItems</b> extracts rows for one or more rs# IDs.</li> </ul> </p> <h4>Example: retrieve all variants in the region chr1:200001-200400</h4> <pre><tt>bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/snp/dbSnp153.bb -chrom=chr1 -start=200000 -end=200400 stdout</tt></pre> @@ -553,31 +553,31 @@ <li><a href="https://esp.gs.washington.edu/" target=_blank>GoESP</a></li> <li><a href="https://www.geenivaramu.ee/en" target=_blank>Estonian</a></li> <li><a href="http://www.bris.ac.uk/alspac/participants/genome/" target=_blank>ALSPAC</a></li> <li><a href="https://twinsuk.ac.uk/" target=_blank>TWINSUK</a></li> <li><a href="https://swefreq.nbis.se/dataset/SweGen" target=_blank>NorthernSweden</a></li> <li><a href="https://genomes.vn" target=_blank>Vietnamese</a></li> </ol> </p><p> UCSC also has an <a href="../goldenPath/help/api.html" target=_blank>API</a> that can be used to retrieve values from a particular chromosome range. </p><p> A list of rs# IDs can be pasted/uploaded in the <a href="hgVai" target=_blank>Variant Annotation Integrator</a> -tool in order to find out which genes (if any) the variants are located in, +tool to find out which genes (if any) the variants are located in, as well as functional effect such as intron, coding-synonymous, missense, frameshift, etc. </p><p> Please refer to our searchable <A HREF="https://groups.google.com/a/soe.ucsc.edu/forum/?hl=en&fromgroups#!search/download+snps" target=_blank>mailing list archives</a> for more questions and example queries, or our <a HREF="../FAQ/FAQdownloads.html#download36" target=_blank>Data Access FAQ</a> for more information. </p> <h2>References</h2> <p> Holmes JB, Moyer E, Phan L, Maglott D, Kattman B. <a href="https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btz856"