895644af226dd57285d4b0469da87dc34df76e5e angie Thu Oct 24 13:55:25 2019 -0700 trackDb and descriptions for new tracks dbSnp152* and dbSnp153*. tagTypes changes for bigDbSnp type (and subsequent big* types). refs #23283 diff --git src/hg/makeDb/trackDb/human/dbSnp153Composite.html src/hg/makeDb/trackDb/human/dbSnp153Composite.html new file mode 100644 index 0000000..e34d5b8 --- /dev/null +++ src/hg/makeDb/trackDb/human/dbSnp153Composite.html @@ -0,0 +1,392 @@ +
+This track shows short genetic variants +(up to approximately 50 base pairs) from +dbSNP +build 153: +single-nucleotide variants (SNVs), +small insertions, deletions, and complex deletion/insertions, +relative to the reference genome assembly. +Most variants in dbSNP are rare, not true polymorphisms, +and some variants are known to be pathogenic. +
+For hg38 (GRCh38), approximately 667 million distinct variants +(RefSNP clusters with rs# ids) +have been mapped to over 702 million genomic locations +including alternate haplotype and fix patch sequences. +dbSNP remapped variants from hg38 to hg19 (GRCh37); +approximately 658 million distinct variants were mapped to +over 683 million genomic locations +including alternate haplotype and fix patch sequences (not +all of which are included in UCSC's hg19). +
++This track includes four subtracks: +
+SNVs and pure deletions are displayed as boxes covering the affected base(s). +Pure insertions are drawn as single-pixel tickmarks between +the base before and the base after the insertion. +
+Insertions and/or deletions in repetitive regions may be represented by a half-height box +showing uncertainty in placement, followed by a full-height box showing the number of deleted +bases, or a full-height tickmark to indicate an insertion. +When an insertion or deletion falls in a repetitive region, the placement may be ambiguous. +For example, if the reference genome contains "TAAAG" but some +individuals have "TAAG" at the same location, then the variant is a deletion of a single +A relative to the reference genome. +However, which A was deleted? There is no way to tell whether the first, second or third A +was removed. +Different variant mapping tools may place the deletion at different bases in the reference genome. +In order to reduce errors in merging variant calls made with different left vs. right biases, +dbSNP made a major change in its representation of deletion/insertion variants in build 152. +Now, instead of assigning a single-base genomic location at one of the A's, +dbSNP expands the coordinates to encompass the whole repetitive region, +so the variant is represented as a deletion of 3 A's combined with an insertion of 2 A's. +In the track display, there will be a half-height box covering the first two A's, +followed by a full-height box covering the third A, in order to show a net loss of one base +but an uncertain placement within the three A's. +
++Variants are colored according to functional effect on genes annotated by dbSNP. +Protein-altering variants and splice site variants are +red, +synonymous codon variants are +green, +and non-coding transcript or Untranslated Region (UTR) variants are +blue. +
++On the track controls page, several variant properties can be included or excluded from +the item labels: +rs# identifier assigned by dbSNP, +reference/alternate alleles, +major/minor alleles (when available) and +minor allele frequency (when available). +Allele frequencies are reported independently by twelve projects, as described by dbSNP: +
+Using the track controls, variants can be filtered by + +
+While processing the information downloaded from dbSNP, +UCSC annotates some properties of interest. +These are noted on the item details page, +and may be useful to include or exclude affected variants. +Some are purely informational: +
+keyword in data file (dbSnp152.bb) | +# in hg19 | # in hg38 | description |
---|---|---|---|
clinvar | +454656 | +453954 | +Variant is in ClinVar. | +
commonAll | +12178426 | +12430253 | +Variant is "common", i.e. has a Minor Allele Frequency of at least 1% in all + projects reporting frequencies. | +
commonSome | +20534330 | +20893174 | +Variant is "common", i.e. has a Minor Allele Frequency of at least 1% in some, but not all, + projects reporting frequencies. | +
diffMajor | +3522349 | +3573503 | +Different frequency sources have different major alleles. | +
overlapDiffClass | +106940656 | +109838613 | +This variant overlaps another variant with a different type/class. | +
overlapSameClass | +16890303 | +17228657 | +This variant overlaps another with the same type/class but different start/end. | +
revStrand | +3813390 | +4512600 | +The orientation of the currently viewed reference genome sequence is different from + the orientation of dbSNP's preferred top-level assembly sequence; + alleles are presented on the forward strand of the currently viewed reference sequence. | +
+while others may indicate that the reference genome contains a rare variant or sequencing issue: +
+keyword in data file (dbSnp152.bb) | +# in hg19 | # in hg38 | description |
---|---|---|---|
refIsAmbiguous | +101 | +111 | +The reference genome allele contains an IUPAC ambiguous base + (e.g. 'R' for 'A or G', or 'N' for 'any base'). | +
refIsMinor | +16032028 | +16277729 | +The reference genome allele is not the major allele in at least one project. | +
refIsRare | +142937 | +166192 | +The reference genome allele is rare (i.e. allele frequency < 1%). | +
refIsSingleton | +44382 | +56491 | +The reference genome allele has never been observed in a population sequencing project + reporting frequencies. | +
refMismatch | +4 | +33 | +The reference genome allele reported by dbSNP differs from the GenBank assembly sequence. + This is very rare and in all cases observed so far, the GenBank assembly has an 'N' + while the RefSeq assembly used by dbSNP has a less ambiguous character such as 'R'. | +
+and others may indicate an anomaly or problem with the variant data: +
+keyword in data file (dbSnp152.bb) | +# in hg19 | # in hg38 | description |
---|---|---|---|
altIsAmbiguous | +10747 | +10873 | +At least one alternate allele contains an IUPAC ambiguous base (e.g. 'R' for 'A or G'). + For alleles containing more than one ambiguous base, this may create a + combinatoric explosion of possible alleles. | +
classMismatch | +5701 | +5864 | +Variation class/type is inconsistent with alleles mapped to this genome assembly. | +
clusterError | +113678 | +126973 | +This variant has the same start, end and class as another variant; + they probably should have been merged into one variant. | +
freqIsAmbiguous | +7649 | +7749 | +At least one allele reported by at least one project that reports frequencies + contains an IUPAC ambiguous base. | +
freqNotRefAlt | +25413 | +39038 | +At least one allele reported by at least one project that reports frequencies + does not match any of the reference or alternate alleles listed by dbSNP. | +
multiMap | +561309 | +132015 | +This variant has been mapped to more than one distinct genomic location. | +
+dbSNP has collected genetic variant reports from researchers worldwide for +over 20 years. +Since the advent of next-generation sequencing methods and the population sequencing efforts +that they enable, dbSNP has grown exponentially, requiring a new data schema, computational pipeline, +web infrastructure, and download files. +(Holmes et al.) +The same challenges of exponential growth affected UCSC's presentation of dbSNP variants, +so we have taken the opportunity to change our internal representation and import pipeline. +Most notably, flanking sequences are no longer provided by dbSNP, +since most submissions have been genomic variant calls in VCF format as opposed to +independent sequences. +
++We downloaded dbSNP's JSON files available from +ftp://ftp.ncbi.nlm.nih.gov/snp/archive/b153/JSON/, +extracted a subset of the information about each variant, and collated +it into a bigBed file using the +bigDbSnp.as schema with the information +necessary for filtering and displaying the variants, +as well as a separate file containing more detailed information to be +displayed on each variant's details page +(dbSnpDetails.as schema). + +
+The raw data underlying the UCSC Genome Browser track can be explored interactively with the +Table Browser, +Data Integrator, +or Variant Annotation Integrator. +For automated analysis, the track data files can be downloaded from the downloads server for +hg38 and +hg19 +(dbSnp153.bb); the detailed variant properties can be downloaded from +hgFixed +(dbSnp153Details.tab.gz). +
+Please refer to our +mailing list archives +for questions and example queries, or our +Data Access FAQ +for more information. +
+ ++https://www.biorxiv.org/content/10.1101/537449v3.full +
+ ++https://www.ncbi.nlm.nih.gov/pubmed/30395293 +PMC: PMC6323993 +
+ ++Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. +dbSNP: the NCBI database of genetic variation. +Nucleic Acids Res. 2001 Jan 1;29(1):308-11. +PMID: 11125122; +PMC: PMC29783 +
+