895644af226dd57285d4b0469da87dc34df76e5e angie Thu Oct 24 13:55:25 2019 -0700 trackDb and descriptions for new tracks dbSnp152* and dbSnp153*. tagTypes changes for bigDbSnp type (and subsequent big* types). refs #23283 diff --git src/hg/makeDb/trackDb/human/dbSnp152.shared.html src/hg/makeDb/trackDb/human/dbSnp152.shared.html new file mode 100644 index 0000000..e88a07f --- /dev/null +++ src/hg/makeDb/trackDb/human/dbSnp152.shared.html @@ -0,0 +1,324 @@ +
+SNVs and pure deletions are displayed as boxes covering the affected base(s). +Pure insertions are drawn as single-pixel tickmarks between +the base before and the base after the insertion. +
+Insertions and/or deletions in repetitive regions may be represented by a half-height box +showing uncertainty in placement, followed by a full-height box showing the number of deleted +bases, or a full-height tickmark to indicate an insertion. +When an insertion or deletion falls in a repetitive region, the placement may be ambiguous. +For example, if the reference genome contains "TAAAG" but some +individuals have "TAAG" at the same location, then the variant is a deletion of a single +A relative to the reference genome. +However, which A was deleted? There is no way to tell whether the first, second or third A +was removed. +Different variant mapping tools may place the deletion at different bases in the reference genome. +In order to reduce errors in merging variant calls made with different left vs. right biases, +dbSNP made a major change in its representation of deletion/insertion variants in build 152. +Now, instead of assigning a single-base genomic location at one of the A's, +dbSNP expands the coordinates to encompass the whole repetitive region, +so the variant is represented as a deletion of 3 A's combined with an insertion of 2 A's. +In the track display, there will be a half-height box covering the first two A's, +followed by a full-height box covering the third A, in order to show a net loss of one base +but an uncertain placement within the three A's. +
++Variants are colored according to functional effect on genes annotated by dbSNP. +Protein-altering variants and splice site variants are +red, +synonymous codon variants are +green, +and non-coding transcript or Untranslated Region (UTR) variants are +blue. +
++On the track controls page, several variant properties can be included or excluded from +the item labels: +rs# identifier assigned by dbSNP, +reference/alternate alleles, +major/minor alleles (when available) and +minor allele frequency (when available). +Allele frequencies are reported independently by nine projects, as described by dbSNP: +
+Using the track controls, variants can be filtered by + +
+While processing the information downloaded from dbSNP, +UCSC annotates some properties of interest. +These are noted on the item details page, +and may be useful to include or exclude affected variants. +Some are purely informational: +
+keyword in data file (dbSnp152.bb) | +# in hg19 | # in hg38 | description |
---|---|---|---|
clinvar | +409132 | +408665 | +Variant is in ClinVar. | +
commonAll | +12757487 | +13027110 | +Variant is "common", i.e. has a Minor Allele Frequency of at least 1% in all + projects reporting frequencies. | +
commonSome | +18901486 | +19258751 | +Variant is "common", i.e. has a Minor Allele Frequency of at least 1% in some, but not all, + projects reporting frequencies. | +
diffMajor | +823424 | +836327 | +Different frequency sources have different major alleles. | +
overlapDiffClass | +99618012 | +102260850 | +This variant overlaps another variant with a different type/class. | +
overlapSameClass | +14790469 | +15075710 | +This variant overlaps another with the same type/class but different start/end. | +
revStrand | +3761191 | +4439534 | +The orientation of the currently viewed reference genome sequence is different from + the orientation of dbSNP's preferred top-level assembly sequence; + alleles are presented on the forward strand of the currently viewed reference sequence. | +
+while others may indicate that the reference genome contains a rare variant or sequencing issue: +
+keyword in data file (dbSnp152.bb) | +# in hg19 | # in hg38 | description |
---|---|---|---|
refIsAmbiguous | +101 | +110 | +The reference genome allele contains an IUPAC ambiguous base + (e.g. 'R' for 'A or G', or 'N' for 'any base'). | +
refIsMinor | +2933684 | +3033691 | +The reference genome allele is not the major allele in at least one project. | +
refIsRare | +150892 | +189809 | +The reference genome allele is rare (i.e. allele frequency < 1%). | +
refIsSingleton | +45618 | +63804 | +The reference genome allele has never been observed in a population sequencing project + reporting frequencies. | +
refMismatch | +4 | +33 | +The reference genome allele reported by dbSNP differs from the GenBank assembly sequence. + This is very rare and in all cases observed so far, the GenBank assembly has an 'N' + while the RefSeq assembly used by dbSNP has a less ambiguous character such as 'R'. | +
+and others may indicate an anomaly or problem with the variant data: +
+keyword in data file (dbSnp152.bb) | +# in hg19 | # in hg38 | description |
---|---|---|---|
altIsAmbiguous | +10680 | +10807 | +At least one alternate allele contains an IUPAC ambiguous base (e.g. 'R' for 'A or G'). + For alleles containing more than one ambiguous base, this may create a + combinatoric explosion of possible alleles. | +
classMismatch | +4808 | +5103 | +Variation class/type is inconsistent with alleles mapped to this genome assembly. | +
clusterError | +106941 | +94310 | +This variant has the same start, end and class as another variant; + they probably should have been merged into one variant. | +
freqIsAmbiguous | +7635 | +7636 | +At least one allele reported by at least one project that reports frequencies + contains an IUPAC ambiguous base. | +
freqNotRefAlt | +23027 | +36306 | +At least one allele reported by at least one project that reports frequencies + does not match any of the reference or alternate alleles listed by dbSNP. | +
multiMap | +555144 | +130175 | +This variant has been mapped to more than one distinct genomic location. | +
+dbSNP has collected genetic variant reports from researchers worldwide for +over 20 years. +Since the advent of next-generation sequencing methods and the population sequencing efforts +that they enable, dbSNP has grown exponentially, requiring a new data schema, computational pipeline, +web infrastructure, and download files. +(Holmes et al.) +The same challenges of exponential growth affected UCSC's presentation of dbSNP variants, +so we have taken the opportunity to change our internal representation and import pipeline. +Most notably, flanking sequences are no longer provided by dbSNP, +since most submissions have been genomic variant calls in VCF format as opposed to +independent sequences. +
++We downloaded dbSNP's JSON files available from +ftp://ftp.ncbi.nlm.nih.gov/snp/archive/b152/JSON/, +extracted a subset of the information about each variant, and collated +it into a bigBed file using the +bigDbSnp.as schema with the information +necessary for filtering and displaying the variants, +as well as a separate file containing more detailed information to be +displayed on each variant's details page +(dbSnpDetails.as schema). + +
+The raw data underlying the UCSC Genome Browser track can be explored interactively with the +Table Browser, +Data Integrator, +or Variant Annotation Integrator. +For automated analysis, the track data files can be downloaded from the downloads server for +hg38 and +hg19 +(dbSnp152.bb); the detailed variant properties can be downloaded from +hgFixed +(dbSnp152Details.tab.gz). +
+Please refer to our +mailing list archives +for questions and example queries, or our +Data Access FAQ +for more information. +
+ ++https://www.biorxiv.org/content/10.1101/537449v3.full +
+ ++https://www.ncbi.nlm.nih.gov/pubmed/30395293 +PMC: PMC6323993 +
+ ++Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. +dbSNP: the NCBI database of genetic variation. +Nucleic Acids Res. 2001 Jan 1;29(1):308-11. +PMID: 11125122; +PMC: PMC29783 +