895644af226dd57285d4b0469da87dc34df76e5e angie Thu Oct 24 13:55:25 2019 -0700 trackDb and descriptions for new tracks dbSnp152* and dbSnp153*. tagTypes changes for bigDbSnp type (and subsequent big* types). refs #23283 diff --git src/hg/makeDb/trackDb/human/dbSnp152.shared.html src/hg/makeDb/trackDb/human/dbSnp152.shared.html new file mode 100644 index 0000000..e88a07f --- /dev/null +++ src/hg/makeDb/trackDb/human/dbSnp152.shared.html @@ -0,0 +1,324 @@ +<h2>Interpreting and Configuring the Graphical Display</h2> +<p> +SNVs and pure deletions are displayed as boxes covering the affected base(s). +Pure insertions are drawn as single-pixel tickmarks between +the base before and the base after the insertion. +</p><p> +Insertions and/or deletions in repetitive regions may be represented by a half-height box +showing uncertainty in placement, followed by a full-height box showing the number of deleted +bases, or a full-height tickmark to indicate an insertion. +When an insertion or deletion falls in a repetitive region, the placement may be ambiguous. +For example, if the reference genome contains "TAAAG" but some +individuals have "TAAG" at the same location, then the variant is a deletion of a single +A relative to the reference genome. +However, which A was deleted? There is no way to tell whether the first, second or third A +was removed. +Different variant mapping tools may place the deletion at different bases in the reference genome. +In order to reduce errors in merging variant calls made with different left vs. right biases, +dbSNP made a major change in its representation of deletion/insertion variants in build 152. +Now, instead of assigning a single-base genomic location at one of the A's, +dbSNP expands the coordinates to encompass the whole repetitive region, +so the variant is represented as a deletion of 3 A's combined with an insertion of 2 A's. +In the track display, there will be a half-height box covering the first two A's, +followed by a full-height box covering the third A, in order to show a net loss of one base +but an uncertain placement within the three A's. +</p> +<p> +Variants are colored according to functional effect on genes annotated by dbSNP. +Protein-altering variants and splice site variants are +red, +synonymous codon variants are +green, +and non-coding transcript or Untranslated Region (UTR) variants are +blue. +</p> +<p> +On the track controls page, several variant properties can be included or excluded from +the item labels: +rs# identifier assigned by dbSNP, +reference/alternate alleles, +major/minor alleles (when available) and +minor allele frequency (when available). +Allele frequencies are reported independently by nine projects, as described by dbSNP: + <ul> + <li><a href="https://www.internationalgenome.org/" target=_blank>1000Genomes</a>: + The 1000 Genomes dataset contains data for 2,504 individuals from 26 populations. + </li> + <li><a href="https://gnomad.broadinstitute.org/" target=_blank>GnomAD_exomes</a>: + The GnomAD exome data set (release v2.1). + </li> + <li><a href="https://www.nhlbiwgs.org/" target=_blank>TOPMED</a>: + The TOPMED dataset contains phase 3 data from freeze 5 panel that include over 60,000 + individuals. The approximate ethnic breakdown is European(52%), African (31%), + Hispanic or Latino (10%), and East Asian (7%) ancestry. + </li> + <li><a href="https://gnomad.broadinstitute.org/" target=_blank>GnomAD</a>: + gnomAD v2.1 comprises a total of 16mln SNVs and 1.2mln indels from 125,748 exomes, + and 229mln SNVs and 33mln indels from 15,708 genomes. In addition to the 7 populations + already present in gnomAD 2.0.2, this release now breaks down the non-Finnish Europeans + and East Asian populations further into sub-populations. + </li> + <li><a href="http://exac.broadinstitute.org/" target=_blank>ExAC</a>: + The Exome Aggregation Consortium (ExAC) dataset contains 60,706 unrelated individuals + sequenced as part of various disease-specific and population genetic studies. + Individuals affected by severe pediatric disease have been removed. + </li> + <li><a href="https://esp.gs.washington.edu/" target=_blank>GoESP</a>: + The NHLBI Grand Opportunity Exome Sequencing Project (GO-ESP) dataset contains 6503 samples + drawn from multiple ESP cohorts and represents all of the ESP exome variant data. + </li> + <li><a href="http://www.bris.ac.uk/alspac/participants/genome/" target=_blank>ALSPAC</a>: + The UK10K - Avon Longitudinal Study of Parents and Children project contains 1927 sample + including individuals obtained from the + <a href="http://www.bristol.ac.uk/alspac/" target=_blank>ALSPAC population</a>. + This population contains more than 14,000 mothers enrolled during pregnancy in 1991 and 1992. + </li> + <li><a href="https://twinsuk.ac.uk/" target=_blank>TWINSUK</a>: + The UK10K - TwinsUK project contains 1854 samples from the + <a href="http://www.twinsuk.ac.uk/" target=_blank>Department of Twin Research and + Genetic Epidemiology (DTR)</a>. + The dataset contains data obtained from the 11,000 identical and non-identical twins + between the ages of 16 and 85 years old. + </li> + <li><a href="https://www.geenivaramu.ee/en" target=_blank>Estonian</a>: + Genetic variation in the Estonian population: pharmacogenomics study of + adverse drug effects using electronic health records. + </li> + </ul> +The project from which to take allele frequency data defaults to 1000 Genomes +but can be set to any of those projects. +</p> +<p> +Using the track controls, variants can be filtered by + + <ul> + <li>minimum minor allele frequency (MAF) + </li> + <li>variation class/type (e.g. SNV, insertion, deletion) + </li> + <li>functional effect on a gene (e.g. synonymous, frameshift, intron, upstream) + </li> + <li>assorted features and anomalies noted by UCSC during processing of dbSNP's data. + </li> + </ul> +</p> + +<a name="ucscNotes"> +<h3>Interesting and anomalous conditions noted by UCSC</h3> +<p> +While processing the information downloaded from dbSNP, +UCSC annotates some properties of interest. +These are noted on the item details page, +and may be useful to include or exclude affected variants. +Some are purely informational: +</p> +<table class="descTbl"> + <tr><th>keyword in data file (dbSnp152.bb)</th> + <th># in hg19</th><th># in hg38</th><th>description</th></tr> + <tr> + <td>clinvar</td> + <td class="number">409132</td> + <td class="number">408665</td> + <td>Variant is in ClinVar.</td> + </tr> + <tr> + <td>commonAll</td> + <td class="number">12757487</td> + <td class="number">13027110</td> + <td>Variant is "common", i.e. has a Minor Allele Frequency of at least 1% in all + projects reporting frequencies.</td> + </tr> + <tr> + <td>commonSome</td> + <td class="number">18901486</td> + <td class="number">19258751</td> + <td>Variant is "common", i.e. has a Minor Allele Frequency of at least 1% in some, but not all, + projects reporting frequencies.</td> + </tr> + <tr> + <td>diffMajor</td> + <td class="number">823424</td> + <td class="number">836327</td> + <td>Different frequency sources have different major alleles.</td> + </tr> + <tr> + <td>overlapDiffClass</td> + <td class="number">99618012</td> + <td class="number">102260850</td> + <td>This variant overlaps another variant with a different type/class.</td> + </tr> + <tr> + <td>overlapSameClass</td> + <td class="number">14790469</td> + <td class="number">15075710</td> + <td>This variant overlaps another with the same type/class but different start/end.</td> + </tr> + <tr> + <td>revStrand</td> + <td class="number">3761191</td> + <td class="number">4439534</td> + <td>The orientation of the currently viewed reference genome sequence is different from + the orientation of dbSNP's preferred top-level assembly sequence; + alleles are presented on the forward strand of the currently viewed reference sequence.</td> + </tr> +</table> +<p> +while others may indicate that the reference genome contains a rare variant or sequencing issue: +</p> +<table class="descTbl"> + <tr><th>keyword in data file (dbSnp152.bb)</th> + <th># in hg19</th><th># in hg38</th><th>description</th></tr> + <tr> + <td>refIsAmbiguous</td> + <td class="number">101</td> + <td class="number">110</td> + <td>The reference genome allele contains an IUPAC ambiguous base + (e.g. 'R' for 'A or G', or 'N' for 'any base').</td> + </tr> + <tr> + <td>refIsMinor</td> + <td class="number">2933684</td> + <td class="number">3033691</td> + <td>The reference genome allele is not the major allele in at least one project.</td> + </tr> + <tr> + <td>refIsRare</td> + <td class="number">150892</td> + <td class="number">189809</td> + <td>The reference genome allele is rare (i.e. allele frequency < 1%).</td> + </tr> + <tr> + <td>refIsSingleton</td> + <td class="number">45618</td> + <td class="number">63804</td> + <td>The reference genome allele has never been observed in a population sequencing project + reporting frequencies.</td> + </tr> + <tr> + <td>refMismatch</td> + <td class="number">4</td> + <td class="number">33</td> + <td>The reference genome allele reported by dbSNP differs from the GenBank assembly sequence. + This is very rare and in all cases observed so far, the GenBank assembly has an 'N' + while the RefSeq assembly used by dbSNP has a less ambiguous character such as 'R'.</td> + </tr> +</table> +<p> +and others may indicate an anomaly or problem with the variant data: +</p> +<table class="descTbl"> + <tr><th>keyword in data file (dbSnp152.bb)</th> + <th># in hg19</th><th># in hg38</th><th>description</th></tr> + <tr> + <td>altIsAmbiguous</td> + <td class="number">10680</td> + <td class="number">10807</td> + <td>At least one alternate allele contains an IUPAC ambiguous base (e.g. 'R' for 'A or G'). + For alleles containing more than one ambiguous base, this may create a + combinatoric explosion of possible alleles.</td> + </tr> + <tr> + <td>classMismatch</td> + <td class="number">4808</td> + <td class="number">5103</td> + <td>Variation class/type is inconsistent with alleles mapped to this genome assembly.</td> + </tr> + <tr> + <td>clusterError</td> + <td class="number">106941</td> + <td class="number">94310</td> + <td>This variant has the same start, end and class as another variant; + they probably should have been merged into one variant.</td> + </tr> + <tr> + <td>freqIsAmbiguous</td> + <td class="number">7635</td> + <td class="number">7636</td> + <td>At least one allele reported by at least one project that reports frequencies + contains an IUPAC ambiguous base.</td> + </tr> + <tr> + <td>freqNotRefAlt</td> + <td class="number">23027</td> + <td class="number">36306</td> + <td>At least one allele reported by at least one project that reports frequencies + does not match any of the reference or alternate alleles listed by dbSNP.</td> + </tr> + <tr> + <td>multiMap</td> + <td class="number">555144</td> + <td class="number">130175</td> + <td>This variant has been mapped to more than one distinct genomic location.</td> + </tr> +</table> + + +<h2>Data Sources and Methods</h2> +<p> +dbSNP has collected genetic variant reports from researchers worldwide for +<a href="https://ncbiinsights.ncbi.nlm.nih.gov/2019/10/07/dbsnp-celebrates-20-years/" + target=_blank>over 20 years</a>. +Since the advent of next-generation sequencing methods and the population sequencing efforts +that they enable, dbSNP has grown exponentially, requiring a new data schema, computational pipeline, +web infrastructure, and download files. +(Holmes <em>et al.</em>) +The same challenges of exponential growth affected UCSC's presentation of dbSNP variants, +so we have taken the opportunity to change our internal representation and import pipeline. +Most notably, flanking sequences are no longer provided by dbSNP, +since most submissions have been genomic variant calls in VCF format as opposed to +independent sequences. +</p> +<p> +We downloaded dbSNP's JSON files available from +<a href="ftp://ftp.ncbi.nlm.nih.gov/snp/archive/b152/JSON/" +target=_blank>ftp://ftp.ncbi.nlm.nih.gov/snp/archive/b152/JSON/</a>, +extracted a subset of the information about each variant, and collated +it into a bigBed file using the +<a href="https://genome-source.gi.ucsc.edu/gitlist/kent.git/blob/master/src/hg/lib/bigDbSnp.as" +target=_blank>bigDbSnp.as</a> schema with the information +necessary for filtering and displaying the variants, +as well as a separate file containing more detailed information to be +displayed on each variant's details page +(<a href="https://genome-source.gi.ucsc.edu/gitlist/kent.git/blob/master/src/hg/lib/dbSnpDetails.as" +target=_blank>dbSnpDetails.as</a> schema). + +<h2>Data Access</h2> +<p> +The raw data underlying the UCSC Genome Browser track can be explored interactively with the +<a href="../../cgi-bin/hgTables" target=_blank>Table Browser</a>, +<a href="../../cgi-bin/hgIntegrator" target=_blank>Data Integrator</a>, +or <a href="../../cgi-bin/hgVai" target=_blank>Variant Annotation Integrator</a>. +For automated analysis, the track data files can be downloaded from the downloads server for +<a href="http://hgdownload.soe.ucsc.edu/gbdb/hg38/snp/" target=_blank>hg38</a> and +<a href="http://hgdownload.soe.ucsc.edu/gbdb/hg19/snp/" target=_blank>hg19</a> +(dbSnp152.bb); the detailed variant properties can be downloaded from +<a href="http://hgdownload.soe.ucsc.edu/gbdb/hgFixed/dbSnp/" target=_blank>hgFixed</a> +(dbSnp152Details.tab.gz). +</p><p> +Please refer to our +<A HREF="https://groups.google.com/a/soe.ucsc.edu/forum/?hl=en&fromgroups#!search/download+snps" +target=_blank>mailing list archives</a> +for questions and example queries, or our +<a HREF="../FAQ/FAQdownloads.html#download36" target=_blank>Data Access FAQ</a> +for more information. +</p> + +<h2>References</h2> + +<p> +https://www.biorxiv.org/content/10.1101/537449v3.full +</p> + +<p> +https://www.ncbi.nlm.nih.gov/pubmed/30395293 +PMC: PMC6323993 +</p> + +<p> +Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. +<a HREF="https://academic.oup.com/nar/article/29/1/308/1116004/dbSNP-the-NCBI-database-of- +genetic-variation" target="_blank">dbSNP: the NCBI database of genetic variation</a>. +<em>Nucleic Acids Res</em>. 2001 Jan 1;29(1):308-11. +PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/11125122" target="_blank">11125122</a>; +PMC: <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC29783/" target="_blank">PMC29783</a> +</p>