b49e61be4ad54a46e01be904fa8a8985e9850f0d angie Tue Nov 12 12:27:30 2019 -0800 dbSnp153: add a bigBed4 subtrack of coordinate ranges for mappings that we dropped due to inconsistent SPDI. refs #23283 Overall counts increased because we used to bail on an entire variant when we discovered an inconsistent SPDI, losing some valid mappings. Now we go through all mappings, and the bad ones are stored instead of dropped. diff --git src/hg/makeDb/trackDb/human/dbSnp153Composite.html src/hg/makeDb/trackDb/human/dbSnp153Composite.html index 4df4a35..b5f89b0 100644 --- src/hg/makeDb/trackDb/human/dbSnp153Composite.html +++ src/hg/makeDb/trackDb/human/dbSnp153Composite.html @@ -9,59 +9,66 @@ relative to the reference genome assembly. Most variants in dbSNP are rare, not true polymorphisms, and some variants are known to be pathogenic.

For hg38 (GRCh38), approximately 667 million distinct variants (RefSNP clusters with rs# ids) have been mapped to over 702 million genomic locations including alternate haplotype and fix patch sequences. dbSNP remapped variants from hg38 to hg19 (GRCh37); approximately 658 million distinct variants were mapped to over 683 million genomic locations including alternate haplotype and fix patch sequences (not all of which are included in UCSC's hg19).

-This track includes four subtracks: +This track includes four subtracks of variants:

+

+A fifth subtrack highlights coordinate ranges to which dbSNP mapped a variant but with genomic +coordinates that are not self-consistent, i.e. different coordinate ranges were provided when describing different alleles, which can occur due to a bug with mapping variants from one assembly sequence to another when there is an indel difference between the assembly sequences: +

+

Interpreting and Configuring the Graphical Display

SNVs and pure deletions are displayed as boxes covering the affected base(s). Pure insertions are drawn as single-pixel tickmarks between the base before and the base after the insertion.

Insertions and/or deletions in repetitive regions may be represented by a half-height box showing uncertainty in placement, followed by a full-height box showing the number of deleted bases, or a full-height tickmark to indicate an insertion. When an insertion or deletion falls in a repetitive region, the placement may be ambiguous. For example, if the reference genome contains "TAAAG" but some individuals have "TAAG" at the same location, then the variant is a deletion of a single A relative to the reference genome. However, which A was deleted? There is no way to tell whether the first, second or third A @@ -178,194 +185,194 @@

Interesting and anomalous conditions noted by UCSC

While processing the information downloaded from dbSNP, UCSC annotates some properties of interest. These are noted on the item details page, and may be useful to include or exclude affected variants.

Some are purely informational:

- - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + + - - + +
keyword in data file (dbSnp153.bb) # in hg19# in hg38description
clinvar454656453954454674453990 Variant is in ClinVar.
clinvarBenign143844143696143860143730 Variant is in ClinVar with clinical significance of benign and/or likely benign.
clinvarConflicting 7932 7950 Variant is in ClinVar with reports of both benign and pathogenic significance.
clinvarPathogenic 96242 95262 Variant is in ClinVar with clinical significance of pathogenic and/or likely pathogenic.
commonAll12178426124302531218422612438325 Variant is "common", i.e. has a Minor Allele Frequency of at least 1% in all projects reporting frequencies.
commonSome20534330208931742054088220902602 Variant is "common", i.e. has a Minor Allele Frequency of at least 1% in some, but not all, projects reporting frequencies.
diffMajor1377402139859113778171399094 Different frequency sources have different major alleles.
overlapDiffClass106940656109838613107003090109991096 This variant overlaps another variant with a different type/class.
overlapSameClass16890303172286571691040717281744 This variant overlaps another with the same type/class but different start/end.
rareAll662571654681626796662595470681685476 Variant is "rare", i.e. has a Minor Allele Frequency of less than 1% in all projects reporting frequencies, or has no frequency data.
rareSome670927558690089717670952126690149753 Variant is "rare", i.e. has a Minor Allele Frequency of less than 1% in some, but not all, projects reporting frequencies, or has no frequency data.
revStrand3813390451260038134674532270 Alleles are displayed on the + strand at the current position. dbSNP's alleles are displayed on the + strand of a different assembly sequence, so dbSNP's variant page shows alleles that are reverse-complemented with respect to the alleles displayed above.

while others may indicate that the reference genome contains a rare variant or sequencing issue:

- - + + - - + + - - + +
keyword in data file (dbSnp153.bb) # in hg19# in hg38description
refIsAmbiguous 101 111 The reference genome allele contains an IUPAC ambiguous base (e.g. 'R' for 'A or G', or 'N' for 'any base').
refIsMinor3269451335655732718783360159 The reference genome allele is not the major allele in at least one project.
refIsRare135265158562136452160723 The reference genome allele is rare (i.e. allele frequency < 1%).
refIsSingleton36709488593778350865 The reference genome allele has never been observed in a population sequencing project reporting frequencies.
refMismatch 4 33 The reference genome allele reported by dbSNP differs from the GenBank assembly sequence. This is very rare and in all cases observed so far, the GenBank assembly has an 'N' while the RefSeq assembly used by dbSNP has a less ambiguous character such as 'R'.

and others may indicate an anomaly or problem with the variant data:

- - + + - - + + - - + + - - + + - - + + - - + +
keyword in data file (dbSnp153.bb) # in hg19# in hg38description
altIsAmbiguous10747108731075410880 At least one alternate allele contains an IUPAC ambiguous base (e.g. 'R' for 'A or G'). For alleles containing more than one ambiguous base, this may create a combinatoric explosion of possible alleles.
classMismatch5701586459956206 Variation class/type is inconsistent with alleles mapped to this genome assembly.
clusterError113678126973114685128109 This variant has the same start, end and class as another variant; they probably should have been merged into one variant.
freqIsAmbiguous7649774976567756 At least one allele reported by at least one project that reports frequencies contains an IUPAC ambiguous base.
freqNotRefAlt16950306151768432150 At least one allele reported by at least one project that reports frequencies does not match any of the reference or alternate alleles listed by dbSNP.
multiMap561309132015562157132051 This variant has been mapped to more than one distinct genomic location.

Data Sources and Methods

dbSNP has collected genetic variant reports from researchers worldwide for over 20 years. Since the advent of next-generation sequencing methods and the population sequencing efforts that they enable, dbSNP has grown exponentially, requiring a new data schema, computational pipeline, web infrastructure, and download files. (Holmes et al.) The same challenges of exponential growth affected UCSC's presentation of dbSNP variants,