895644af226dd57285d4b0469da87dc34df76e5e angie Thu Oct 24 13:55:25 2019 -0700 trackDb and descriptions for new tracks dbSnp152* and dbSnp153*. tagTypes changes for bigDbSnp type (and subsequent big* types). refs #23283 diff --git src/hg/makeDb/trackDb/human/dbSnp152.shared.html src/hg/makeDb/trackDb/human/dbSnp152.shared.html new file mode 100644 index 0000000..e88a07f --- /dev/null +++ src/hg/makeDb/trackDb/human/dbSnp152.shared.html @@ -0,0 +1,324 @@ +

Interpreting and Configuring the Graphical Display

+

+SNVs and pure deletions are displayed as boxes covering the affected base(s). +Pure insertions are drawn as single-pixel tickmarks between +the base before and the base after the insertion. +

+Insertions and/or deletions in repetitive regions may be represented by a half-height box +showing uncertainty in placement, followed by a full-height box showing the number of deleted +bases, or a full-height tickmark to indicate an insertion. +When an insertion or deletion falls in a repetitive region, the placement may be ambiguous. +For example, if the reference genome contains "TAAAG" but some +individuals have "TAAG" at the same location, then the variant is a deletion of a single +A relative to the reference genome. +However, which A was deleted? There is no way to tell whether the first, second or third A +was removed. +Different variant mapping tools may place the deletion at different bases in the reference genome. +In order to reduce errors in merging variant calls made with different left vs. right biases, +dbSNP made a major change in its representation of deletion/insertion variants in build 152. +Now, instead of assigning a single-base genomic location at one of the A's, +dbSNP expands the coordinates to encompass the whole repetitive region, +so the variant is represented as a deletion of 3 A's combined with an insertion of 2 A's. +In the track display, there will be a half-height box covering the first two A's, +followed by a full-height box covering the third A, in order to show a net loss of one base +but an uncertain placement within the three A's. +

+

+Variants are colored according to functional effect on genes annotated by dbSNP. +Protein-altering variants and splice site variants are +red, +synonymous codon variants are +green, +and non-coding transcript or Untranslated Region (UTR) variants are +blue. +

+

+On the track controls page, several variant properties can be included or excluded from +the item labels: +rs# identifier assigned by dbSNP, +reference/alternate alleles, +major/minor alleles (when available) and +minor allele frequency (when available). +Allele frequencies are reported independently by nine projects, as described by dbSNP: +

+The project from which to take allele frequency data defaults to 1000 Genomes +but can be set to any of those projects. +

+

+Using the track controls, variants can be filtered by + +

+

+ + +

Interesting and anomalous conditions noted by UCSC

+

+While processing the information downloaded from dbSNP, +UCSC annotates some properties of interest. +These are noted on the item details page, +and may be useful to include or exclude affected variants. +Some are purely informational: +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
keyword in data file (dbSnp152.bb)# in hg19# in hg38description
clinvar409132408665Variant is in ClinVar.
commonAll1275748713027110Variant is "common", i.e. has a Minor Allele Frequency of at least 1% in all + projects reporting frequencies.
commonSome1890148619258751Variant is "common", i.e. has a Minor Allele Frequency of at least 1% in some, but not all, + projects reporting frequencies.
diffMajor823424836327Different frequency sources have different major alleles.
overlapDiffClass99618012102260850This variant overlaps another variant with a different type/class.
overlapSameClass1479046915075710This variant overlaps another with the same type/class but different start/end.
revStrand37611914439534The orientation of the currently viewed reference genome sequence is different from + the orientation of dbSNP's preferred top-level assembly sequence; + alleles are presented on the forward strand of the currently viewed reference sequence.
+

+while others may indicate that the reference genome contains a rare variant or sequencing issue: +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
keyword in data file (dbSnp152.bb)# in hg19# in hg38description
refIsAmbiguous101110The reference genome allele contains an IUPAC ambiguous base + (e.g. 'R' for 'A or G', or 'N' for 'any base').
refIsMinor29336843033691The reference genome allele is not the major allele in at least one project.
refIsRare150892189809The reference genome allele is rare (i.e. allele frequency < 1%).
refIsSingleton4561863804The reference genome allele has never been observed in a population sequencing project + reporting frequencies.
refMismatch433The reference genome allele reported by dbSNP differs from the GenBank assembly sequence. + This is very rare and in all cases observed so far, the GenBank assembly has an 'N' + while the RefSeq assembly used by dbSNP has a less ambiguous character such as 'R'.
+

+and others may indicate an anomaly or problem with the variant data: +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
keyword in data file (dbSnp152.bb)# in hg19# in hg38description
altIsAmbiguous1068010807At least one alternate allele contains an IUPAC ambiguous base (e.g. 'R' for 'A or G'). + For alleles containing more than one ambiguous base, this may create a + combinatoric explosion of possible alleles.
classMismatch48085103Variation class/type is inconsistent with alleles mapped to this genome assembly.
clusterError10694194310This variant has the same start, end and class as another variant; + they probably should have been merged into one variant.
freqIsAmbiguous76357636At least one allele reported by at least one project that reports frequencies + contains an IUPAC ambiguous base.
freqNotRefAlt2302736306At least one allele reported by at least one project that reports frequencies + does not match any of the reference or alternate alleles listed by dbSNP.
multiMap555144130175This variant has been mapped to more than one distinct genomic location.
+ + +

Data Sources and Methods

+

+dbSNP has collected genetic variant reports from researchers worldwide for +over 20 years. +Since the advent of next-generation sequencing methods and the population sequencing efforts +that they enable, dbSNP has grown exponentially, requiring a new data schema, computational pipeline, +web infrastructure, and download files. +(Holmes et al.) +The same challenges of exponential growth affected UCSC's presentation of dbSNP variants, +so we have taken the opportunity to change our internal representation and import pipeline. +Most notably, flanking sequences are no longer provided by dbSNP, +since most submissions have been genomic variant calls in VCF format as opposed to +independent sequences. +

+

+We downloaded dbSNP's JSON files available from +ftp://ftp.ncbi.nlm.nih.gov/snp/archive/b152/JSON/, +extracted a subset of the information about each variant, and collated +it into a bigBed file using the +bigDbSnp.as schema with the information +necessary for filtering and displaying the variants, +as well as a separate file containing more detailed information to be +displayed on each variant's details page +(dbSnpDetails.as schema). + +

Data Access

+

+The raw data underlying the UCSC Genome Browser track can be explored interactively with the +Table Browser, +Data Integrator, +or Variant Annotation Integrator. +For automated analysis, the track data files can be downloaded from the downloads server for +hg38 and +hg19 +(dbSnp152.bb); the detailed variant properties can be downloaded from +hgFixed +(dbSnp152Details.tab.gz). +

+Please refer to our +mailing list archives +for questions and example queries, or our +Data Access FAQ +for more information. +

+ +

References

+ +

+https://www.biorxiv.org/content/10.1101/537449v3.full +

+ +

+https://www.ncbi.nlm.nih.gov/pubmed/30395293 +PMC: PMC6323993 +

+ +

+Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. +dbSNP: the NCBI database of genetic variation. +Nucleic Acids Res. 2001 Jan 1;29(1):308-11. +PMID: 11125122; +PMC: PMC29783 +