src/hg/makeDb/trackDb/human/dbSnp153Composite.html b23bfd1da3c9926158cd96b7f73bf6b889c3d424

b23bfd1da3c9926158cd96b7f73bf6b889c3d424
angie
  Thu Oct 31 14:06:35 2019 -0700
Minor wording tweak, refs #23283

diff --git src/hg/makeDb/trackDb/human/dbSnp153Composite.html src/hg/makeDb/trackDb/human/dbSnp153Composite.html
index d74342c..1666146 100644
--- src/hg/makeDb/trackDb/human/dbSnp153Composite.html
+++ src/hg/makeDb/trackDb/human/dbSnp153Composite.html
@@ -1,424 +1,423 @@
 <h2>Description</h2>
 <p>
 This track shows short genetic variants
 (up to approximately 50 base pairs) from
 <A HREF="https://www.ncbi.nlm.nih.gov/SNP/" target=_blank>dbSNP</A>
 build 153:
 single-nucleotide variants (SNVs),
 small insertions, deletions, and complex deletion/insertions,
 relative to the reference genome assembly.
 Most variants in dbSNP are rare, not true polymorphisms,
 and some variants are known to be pathogenic.
 </p><p>
 For hg38 (GRCh38), approximately 667 million distinct variants
 (RefSNP clusters with rs# ids)
 have been mapped to over 702 million genomic locations
 including alternate haplotype and fix patch sequences.
 dbSNP remapped variants from hg38 to hg19 (GRCh37);
 approximately 658 million distinct variants were mapped to
 over 683 million genomic locations
 including alternate haplotype and fix patch sequences (not
 all of which are included in UCSC's hg19).
 </p>
 <p>
 This track includes four subtracks:
   <ul>
     <li><b>All dbSNP (153)</b>: the entire set (683 million for hg19, 702 million for hg38)
     </li>
     <li><b>Common dbSNP (153)</b>: approximately 15 million variants with a minor allele
       frequency (MAF) of at least 1% (0.01) in the 1000 Genomes Phase 3 dataset.
       Variants in the Mult. subset (below) are excluded.
     </li>
     <li><b>ClinVar dbSNP (153)</b>: approximately 455 thousand variants mentioned in ClinVar.
       <b>Note:</b> that includes both benign and pathogenic (as well as uncertain) variants.
       Variants in the Mult. subset (below) are excluded.
     </li>
     <li><b>Mult. dbSNP (153)</b>: variants that have been mapped to multiple chromosomes,
       for example chr1 and chr2,
       raising the question of whether the variant is really a variant or just a difference
       between duplicated sequences.
       There are some exceptions in which a variant is mapped to more than one reference
       sequence, but not culled into this set:
       <ul>
         <li>A variant may appear in both X and Y
           pseudo-autosomal regions (PARs) without being included in this set.
         </li>
         <li>A variant may also appear in a main chromosome as well as an alternate haplotype
           or fix patch sequence assigned to that chromosome.
         </li>
       </ul>
     </li>
   </ul>
 </p>
 
 <h2>Interpreting and Configuring the Graphical Display</h2>
 <p>
 SNVs and pure deletions are displayed as boxes covering the affected base(s).
 Pure insertions are drawn as single-pixel tickmarks between
 the base before and the base after the insertion.
 </p><p>
 Insertions and/or deletions in repetitive regions may be represented by a half-height box
 showing uncertainty in placement, followed by a full-height box showing the number of deleted
 bases, or a full-height tickmark to indicate an insertion.
 When an insertion or deletion falls in a repetitive region, the placement may be ambiguous.
 For example, if the reference genome contains "TAAAG" but some
 individuals have "TAAG" at the same location, then the variant is a deletion of a single
 A relative to the reference genome.
 However, which A was deleted?  There is no way to tell whether the first, second or third A
 was removed.
 Different variant mapping tools may place the deletion at different bases in the reference genome.
 In order to reduce errors in merging variant calls made with different left vs. right biases,
 dbSNP made a major change in its representation of deletion/insertion variants in build 152.
 Now, instead of assigning a single-base genomic location at one of the A's,
 dbSNP expands the coordinates to encompass the whole repetitive region,
 so the variant is represented as a deletion of 3 A's combined with an insertion of 2 A's.
 In the track display, there will be a half-height box covering the first two A's,
 followed by a full-height box covering the third A, in order to show a net loss of one base
 but an uncertain placement within the three A's.
 </p>
 <p>
 Variants are colored according to functional effect on genes annotated by dbSNP.
 Protein-altering variants and splice site variants are
 red,
 synonymous codon variants are
 green,
 and non-coding transcript or Untranslated Region (UTR) variants are
 blue.
 </p>
 <p>
 On the track controls page, several variant properties can be included or excluded from
 the item labels:
 rs# identifier assigned by dbSNP,
 reference/alternate alleles,
 major/minor alleles (when available) and
 minor allele frequency (when available).
 Allele frequencies are reported independently by twelve projects, as described by dbSNP:
   <ul>
     <li><a href="https://www.internationalgenome.org/" target=_blank>1000Genomes</a>:
       The 1000 Genomes dataset contains data for 2,504 individuals from 26 populations.
     </li>
     <li><a href="https://gnomad.broadinstitute.org/" target=_blank>GnomAD_exomes</a>:
       The GnomAD exome data set (release v2.1).
     </li>
     <li><a href="https://www.nhlbiwgs.org/" target=_blank>TOPMED</a>:
       The TOPMED dataset contains phase 3 data from freeze 5 panel that include over 60,000
       individuals. The approximate ethnic breakdown is European(52%), African (31%),
       Hispanic or Latino (10%), and East Asian (7%) ancestry.
     </li>
     <li><a href="http://exac.broadinstitute.org/" target=_blank>ExAC</a>:
       The Exome Aggregation Consortium (ExAC) dataset contains 60,706 unrelated individuals
       sequenced as part of various disease-specific and population genetic studies.
       Individuals affected by severe pediatric disease have been removed.
     </li>
     <li><a href="https://www.pagestudy.org/" target=_blank>PAGE_STUDY</a>:
       The PAGE Study: How Genetic Diversity Improves Our Understanding of the Architecture of
       Complex Traits.
     </li>
     <li><a href="https://gnomad.broadinstitute.org/" target=_blank>GnomAD</a>:
       gnomAD v2.1 comprises a total of 16mln SNVs and 1.2mln indels from 125,748 exomes,
       and 229mln SNVs and 33mln indels from 15,708 genomes. In addition to the 7 populations
       already present in gnomAD 2.0.2, this release now breaks down the non-Finnish Europeans
       and East Asian populations further into sub-populations.
     </li>
     <li><a href="https://esp.gs.washington.edu/" target=_blank>GoESP</a>:
       The NHLBI Grand Opportunity Exome Sequencing Project (GO-ESP) dataset contains 6503 samples
       drawn from multiple ESP cohorts and represents all of the ESP exome variant data.
     </li>
     <li><a href="https://www.geenivaramu.ee/en" target=_blank>Estonian</a>:
       Genetic variation in the Estonian population: pharmacogenomics study of
       adverse drug effects using electronic health records.
     </li>
     <li><a href="http://www.bris.ac.uk/alspac/participants/genome/" target=_blank>ALSPAC</a>:
       The UK10K - Avon Longitudinal Study of Parents and Children project contains 1927 sample
       including individuals obtained from the
       <a href="http://www.bristol.ac.uk/alspac/" target=_blank>ALSPAC population</a>.
       This population contains more than 14,000 mothers enrolled during pregnancy in 1991 and 1992.
     </li>
     <li><a href="https://twinsuk.ac.uk/" target=_blank>TWINSUK</a>:
       The UK10K - TwinsUK project contains 1854 samples from the
       <a href="http://www.twinsuk.ac.uk/" target=_blank>Department of Twin Research and
       Genetic Epidemiology (DTR)</a>.
       The dataset contains data obtained from the 11,000 identical and non-identical twins
       between the ages of 16 and 85 years old.
     </li>
     <li><a href="https://swefreq.nbis.se/dataset/SweGen" target=_blank>NorthernSweden</a>:
       Whole-genome sequenced control population in northern Sweden reveals subregional
       genetic differences.  This population consists of 300 whole genome sequenced human samples
       selected from the county of Vasterbotten in northern Sweden. To be selected for inclusion
       into the population, the individuals had to have reached at least 80 years of age and have
       no diagnosed cancer.
     </li>
     <li><a href="https://genomes.vn" target=_blank>Vietnamese</a>:
       A Vietnamese Genetic Variation Database.
     </li>
   </ul>
 The project from which to take allele frequency data defaults to 1000 Genomes
 but can be set to any of those projects.
 </p>
 <p>
 Using the track controls, variants can be filtered by
 
   <ul>
     <li>minimum minor allele frequency (MAF)
     </li>
     <li>variation class/type (e.g. SNV, insertion, deletion)
     </li>
     <li>functional effect on a gene (e.g. synonymous, frameshift, intron, upstream)
     </li>
     <li>assorted features and anomalies noted by UCSC during processing of dbSNP's data.
     </li>
   </ul>
 </p>
 
 <a name="ucscNotes">
 <h3>Interesting and anomalous conditions noted by UCSC</h3>
 <p>
 While processing the information downloaded from dbSNP,
 UCSC annotates some properties of interest.
 These are noted on the item details page,
 and may be useful to include or exclude affected variants.
 Some are purely informational:
 </p>
 <table class="descTbl">
   <tr><th>keyword in data file (dbSnp152.bb)</th>
     <th># in hg19</th><th># in hg38</th><th>description</th></tr>
   <tr>
     <td>clinvar</td>
     <td class="number">454656</td>
     <td class="number">453954</td>
     <td>Variant is in ClinVar.</td>
   </tr>
   <tr>
     <td>clinvarBenign</td>
     <td class="number">143844</td>
     <td class="number">143696</td>
     <td>Variant is in ClinVar with clinical significance of benign and/or likely benign.</td>
   </tr>
   <tr>
     <td>clinvarConflicting</td>
     <td class="number">7932</td>
     <td class="number">7950</td>
     <td>Variant is in ClinVar with reports of both benign and pathogenic significance.</td>
   </tr>
   <tr>
     <td>clinvarPathogenic</td>
     <td class="number">96242</td>
     <td class="number">95262</td>
     <td>Variant is in ClinVar with clinical significance of pathogenic and/or likely pathogenic.</td>
   </tr>
   <tr>
     <td>commonAll</td>
     <td class="number">12178426</td>
     <td class="number">12430253</td>
     <td>Variant is "common", i.e. has a Minor Allele Frequency of at least 1% in all
       projects reporting frequencies.</td>
   </tr>
   <tr>
     <td>commonSome</td>
     <td class="number">20534330</td>
     <td class="number">20893174</td>
     <td>Variant is "common", i.e. has a Minor Allele Frequency of at least 1% in some, but not all,
       projects reporting frequencies.</td>
   </tr>
   <tr>
     <td>diffMajor</td>
     <td class="number">3522349</td>
     <td class="number">3573503</td>
     <td>Different frequency sources have different major alleles.</td>
   </tr>
   <tr>
     <td>overlapDiffClass</td>
     <td class="number">106940656</td>
     <td class="number">109838613</td>
     <td>This variant overlaps another variant with a different type/class.</td>
   </tr>
   <tr>
     <td>overlapSameClass</td>
     <td class="number">16890303</td>
     <td class="number">17228657</td>
     <td>This variant overlaps another with the same type/class but different start/end.</td>
   </tr>
   <tr>
     <td>rareAll</td>
     <td class="number">662571654</td>
     <td class="number">681626796</td>
     <td>Variant is "rare", i.e. has a Minor Allele Frequency of less than 1%
-      in all projects reporting frequencies, or has been reported without frequency data.</td>
+      in all projects reporting frequencies, or has no frequency data.</td>
   </tr>
   <tr>
     <td>rareSome</td>
     <td class="number">670927558</td>
     <td class="number">690089717</td>
     <td>Variant is "rare", i.e. has a Minor Allele Frequency of less than 1%
-      in some, but not all, projects reporting frequencies, or has been reported without
-      frequency data.</td>
+      in some, but not all, projects reporting frequencies, or has no frequency data.</td>
   </tr>
   <tr>
     <td>revStrand</td>
     <td class="number">3813390</td>
     <td class="number">4512600</td>
     <td>The orientation of the currently viewed reference genome sequence is different from
       the orientation of dbSNP's preferred top-level assembly sequence;
       alleles are presented on the forward strand of the currently viewed reference sequence.</td>
   </tr>
 </table>
 <p>
 while others may indicate that the reference genome contains a rare variant or sequencing issue:
 </p>
 <table class="descTbl">
   <tr><th>keyword in data file (dbSnp152.bb)</th>
     <th># in hg19</th><th># in hg38</th><th>description</th></tr>
   <tr>
     <td>refIsAmbiguous</td>
     <td class="number">101</td>
     <td class="number">111</td>
     <td>The reference genome allele contains an IUPAC ambiguous base
       (e.g. 'R' for 'A or G', or 'N' for 'any base').</td>
   </tr>
   <tr>
     <td>refIsMinor</td>
     <td class="number">16032028</td>
     <td class="number">16277729</td>
     <td>The reference genome allele is not the major allele in at least one project.</td>
   </tr>
   <tr>
     <td>refIsRare</td>
     <td class="number">142937</td>
     <td class="number">166192</td>
     <td>The reference genome allele is rare (i.e. allele frequency < 1%).</td>
   </tr>
   <tr>
     <td>refIsSingleton</td>
     <td class="number">44382</td>
     <td class="number">56491</td>
     <td>The reference genome allele has never been observed in a population sequencing project
       reporting frequencies.</td>
   </tr>
   <tr>
     <td>refMismatch</td>
     <td class="number">4</td>
     <td class="number">33</td>
     <td>The reference genome allele reported by dbSNP differs from the GenBank assembly sequence.
       This is very rare and in all cases observed so far, the GenBank assembly has an 'N'
       while the RefSeq assembly used by dbSNP has a less ambiguous character such as 'R'.</td>
   </tr>
 </table>
 <p>
 and others may indicate an anomaly or problem with the variant data:
 </p>
 <table class="descTbl">
   <tr><th>keyword in data file (dbSnp152.bb)</th>
     <th># in hg19</th><th># in hg38</th><th>description</th></tr>
   <tr>
     <td>altIsAmbiguous</td>
     <td class="number">10747</td>
     <td class="number">10873</td>
     <td>At least one alternate allele contains an IUPAC ambiguous base (e.g. 'R' for 'A or G').
       For alleles containing more than one ambiguous base, this may create a
       combinatoric explosion of possible alleles.</td>
   </tr>
   <tr>
     <td>classMismatch</td>
     <td class="number">5701</td>
     <td class="number">5864</td>
     <td>Variation class/type is inconsistent with alleles mapped to this genome assembly.</td>
   </tr>
   <tr>
     <td>clusterError</td>
     <td class="number">113678</td>
     <td class="number">126973</td>
     <td>This variant has the same start, end and class as another variant;
       they probably should have been merged into one variant.</td>
   </tr>
   <tr>
     <td>freqIsAmbiguous</td>
     <td class="number">7649</td>
     <td class="number">7749</td>
     <td>At least one allele reported by at least one project that reports frequencies
       contains an IUPAC ambiguous base.</td>
   </tr>
   <tr>
     <td>freqNotRefAlt</td>
     <td class="number">25413</td>
     <td class="number">39038</td>
     <td>At least one allele reported by at least one project that reports frequencies
       does not match any of the reference or alternate alleles listed by dbSNP.</td>
   </tr>
   <tr>
     <td>multiMap</td>
     <td class="number">561309</td>
     <td class="number">132015</td>
     <td>This variant has been mapped to more than one distinct genomic location.</td>
   </tr>
 </table>
 
 
 <h2>Data Sources and Methods</h2>
 <p>
 dbSNP has collected genetic variant reports from researchers worldwide for 
 <a href="https://ncbiinsights.ncbi.nlm.nih.gov/2019/10/07/dbsnp-celebrates-20-years/"
    target=_blank>over 20 years</a>.
 Since the advent of next-generation sequencing methods and the population sequencing efforts
 that they enable, dbSNP has grown exponentially, requiring a new data schema, computational pipeline,
 web infrastructure, and download files.
 (Holmes <em>et al.</em>)
 The same challenges of exponential growth affected UCSC's presentation of dbSNP variants,
 so we have taken the opportunity to change our internal representation and import pipeline.
 Most notably, flanking sequences are no longer provided by dbSNP,
 since most submissions have been genomic variant calls in VCF format as opposed to
 independent sequences.
 </p>
 <p>
 We downloaded dbSNP's JSON files available from
 <a href="ftp://ftp.ncbi.nlm.nih.gov/snp/archive/b153/JSON/"
 target=_blank>ftp://ftp.ncbi.nlm.nih.gov/snp/archive/b153/JSON/</a>,
 extracted a subset of the information about each variant, and collated
 it into a bigBed file using the
 <a href="https://genome-source.gi.ucsc.edu/gitlist/kent.git/blob/master/src/hg/lib/bigDbSnp.as"
 target=_blank>bigDbSnp.as</a> schema with the information
 necessary for filtering and displaying the variants,
 as well as a separate file containing more detailed information to be
 displayed on each variant's details page
 (<a href="https://genome-source.gi.ucsc.edu/gitlist/kent.git/blob/master/src/hg/lib/dbSnpDetails.as"
 target=_blank>dbSnpDetails.as</a> schema).
 
 <h2>Data Access</h2>
 <p>
 The raw data underlying the UCSC Genome Browser track can be explored interactively with the
 <a href="../../cgi-bin/hgTables" target=_blank>Table Browser</a> or
 <a href="../../cgi-bin/hgIntegrator" target=_blank>Data Integrator</a>.
 For automated analysis, the track data files can be downloaded from the downloads server for
 <a href="http://hgdownload.soe.ucsc.edu/gbdb/hg38/snp/" target=_blank>hg38</a> and
 <a href="http://hgdownload.soe.ucsc.edu/gbdb/hg19/snp/" target=_blank>hg19</a>
 (dbSnp153.bb); the detailed variant properties can be downloaded from
 <a href="http://hgdownload.soe.ucsc.edu/gbdb/hgFixed/dbSnp/" target=_blank>hgFixed</a>
 (dbSnp153Details.tab.gz).
 </p><p>
 Please refer to our
 <A HREF="https://groups.google.com/a/soe.ucsc.edu/forum/?hl=en&fromgroups#!search/download+snps"
 target=_blank>mailing list archives</a>
 for questions and example queries, or our
 <a HREF="../FAQ/FAQdownloads.html#download36" target=_blank>Data Access FAQ</a>
 for more information.
 </p>
 
 <h2>References</h2>
 
 <p>
 https://www.biorxiv.org/content/10.1101/537449v3.full
 </p>
 
 <p>
 https://www.ncbi.nlm.nih.gov/pubmed/30395293
 PMC: PMC6323993
 </p>
 
 <p>
 Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K.
 <a HREF="https://academic.oup.com/nar/article/29/1/308/1116004/dbSNP-the-NCBI-database-of-
 genetic-variation" target="_blank">dbSNP: the NCBI database of genetic variation</a>.
 <em>Nucleic Acids Res</em>. 2001 Jan 1;29(1):308-11.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/11125122" target="_blank">11125122</a>;
 PMC: <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC29783/" target="_blank">PMC29783</a>
 </p>