8c2f7318d8d821de9b2a25750586a94ab5e8c1bb lrnassar Fri Nov 15 18:50:19 2024 -0800 Giving the UI link cronjob some love by fixing all the 301 redirects. These are the bulk of the items listed on the cron. No RM. diff --git src/hg/makeDb/trackDb/human/dbSnp155Composite.html src/hg/makeDb/trackDb/human/dbSnp155Composite.html index a622dfe..376e91b 100644 --- src/hg/makeDb/trackDb/human/dbSnp155Composite.html +++ src/hg/makeDb/trackDb/human/dbSnp155Composite.html @@ -1,784 +1,784 @@

Description

This track shows short genetic variants (up to approximately 50 base pairs) from dbSNP build 155: single-nucleotide variants (SNVs), small insertions, deletions, and complex deletion/insertions (indels), relative to the reference genome assembly. Most variants in dbSNP are rare, not true polymorphisms, and some variants are known to be pathogenic.

For hg38 (GRCh38), approximately 998 million distinct variants (RefSNP clusters with rs# ids) have been mapped to more than 1.06 billion genomic locations including alternate haplotype and fix patch sequences. dbSNP remapped variants from hg38 to hg19 (GRCh37); approximately 981 million distinct variants were mapped to more than 1.02 billion genomic locations including alternate haplotype and fix patch sequences (not all of which are included in UCSC's hg19).

This track includes four subtracks of variants:

A fifth subtrack highlights coordinate ranges to which dbSNP mapped a variant but with genomic coordinates that are not internally consistent, i.e. different coordinate ranges were provided when describing different alleles. This can occur due to a bug with mapping variants from one assembly sequence to another when there is an indel difference between the assembly sequences:

Interpreting and Configuring the Graphical Display

SNVs and pure deletions are displayed as boxes covering the affected base(s). Pure insertions are drawn as single-pixel tickmarks between the base before and the base after the insertion.

Insertions and/or deletions in repetitive regions may be represented by a half-height box showing uncertainty in placement, followed by a full-height box showing the number of deleted bases, or a full-height tickmark to indicate an insertion. When an insertion or deletion falls in a repetitive region, the placement may be ambiguous. For example, if the reference genome contains "TAAAG" but some individuals have "TAAG" at the same location, then the variant is a deletion of a single A relative to the reference genome. However, which A was deleted? There is no way to tell whether the first, second or third A was removed. Different variant mapping tools may place the deletion at different bases in the reference genome. To reduce errors in merging variant calls made with different left vs. right biases, dbSNP made a major change in its representation of deletion/insertion variants in build 152. Now, instead of assigning a single-base genomic location at one of the A's, dbSNP expands the coordinates to encompass the whole repetitive region, so the variant is represented as a deletion of 3 A's combined with an insertion of 2 A's. In the track display, there will be a half-height box covering the first two A's, followed by a full-height box covering the third A, to show a net loss of one base but an uncertain placement within the three A's.

Variants are colored according to functional effect on genes annotated by dbSNP:

Protein-altering variants and splice site variants are red.
Synonymous codon variants are green.
Non-coding transcript or Untranslated Region (UTR) variants are blue.

On the track controls page, several variant properties can be included or excluded from the item labels: rs# identifier assigned by dbSNP, reference/alternate alleles, major/minor alleles (when available) and minor allele frequency (when available). Allele frequencies are reported independently by the project (some of which may have overlapping sets of samples):

The project from which to take allele frequency data defaults to 1000 Genomes but can be set to any of those projects.

Using the track controls, variants can be filtered by

Interesting and anomalous conditions noted by UCSC

While processing the information downloaded from dbSNP, UCSC annotates some properties of interest. These are noted on the item details page, and may be useful to include or exclude affected variants.

Some are purely informational:

keyword in data file (dbSnp155.bb) # in hg19# in hg38description
clinvar 627817 630503 Variant is in ClinVar.
clinvarBenign 275541 276409 Variant is in ClinVar with clinical significance of benign and/or likely benign.
clinvarConflicting 16925 16834 Variant is in ClinVar with reports of both benign and pathogenic significance.
clinvarPathogenic 56373 56475 Variant is in ClinVar with clinical significance of pathogenic and/or likely pathogenic.
commonAll 14904503 15862783 Variant is "common", i.e. has a Minor Allele Frequency of at least 1% in all projects reporting frequencies.
commonSome 59633864 62095091 Variant is "common", i.e. has a Minor Allele Frequency of at least 1% in some, but not all, projects reporting frequencies.
diffMajor 12748733 13073288 Different frequency sources have different major alleles.
overlapDiffClass 198945442 207101421 This variant overlaps another variant with a different type/class.
overlapSameClass 29281958 30301090 This variant overlaps another with the same type/class but different start/end.
rareAll 906113910 938985356 Variant is "rare", i.e. has a Minor Allele Frequency of less than 1% in all projects reporting frequencies, or has no frequency data.
rareSome 950843271 985217664 Variant is "rare", i.e. has a Minor Allele Frequency of less than 1% in some, but not all, projects reporting frequencies, or has no frequency data.
revStrand 5540864 6770772 Alleles are displayed on the + strand at the current position. dbSNP's alleles are displayed on the + strand of a different assembly sequence, so dbSNP's variant page shows alleles that are reverse-complemented with respect to the alleles displayed above.

while others may indicate that the reference genome contains a rare variant or sequencing issue:

keyword in data file (dbSnp155.bb) # in hg19# in hg38description
refIsAmbiguous 19 41 The reference genome allele contains an IUPAC ambiguous base (e.g. 'R' for 'A or G', or 'N' for 'any base').
refIsMinor 14950212 15386394 The reference genome allele is not the major allele in at least one project.
refIsRare 793081 822757 The reference genome allele is rare (i.e. allele frequency < 1%).
refIsSingleton 694310 712794 The reference genome allele has never been observed in a population sequencing project reporting frequencies.
refMismatch 1 18 The reference genome allele reported by dbSNP differs from the GenBank assembly sequence. This is very rare and in all cases observed so far, the GenBank assembly has an 'N' while the RefSeq assembly used by dbSNP has a less ambiguous character such as 'R'.

and others may indicate an anomaly or problem with the variant data:

keyword in data file (dbSnp155.bb) # in hg19# in hg38description
altIsAmbiguous 5294 5361 At least one alternate allele contains an IUPAC ambiguous base (e.g. 'R' for 'A or G'). For alleles containing more than one ambiguous base, this may create a combinatoric explosion of possible alleles.
classMismatch 13289 18475 Variation class/type is inconsistent with alleles mapped to this genome assembly.
clusterError 373258 459130 This variant has the same start, end and class as another variant; they probably should have been merged into one variant.
freqIncomplete 0 0 At least one project reported counts for only one allele which implies that at least one allele is missing from the report; that project's frequency data are ignored.
freqIsAmbiguous 4332 4399 At least one allele reported by at least one project that reports frequencies contains an IUPAC ambiguous base.
freqNotMapped 1149972 1141935 At least one project reported allele frequencies relative to a different assembly; However, dbSNP does not include a mapping of this variant to that assembly, which implies a problem with mapping the variant across assemblies. The mapping on this assembly may have an issue; evaluate carefully vs. original submissions, which you can view by clicking through to dbSNP above.
freqNotRefAlt 74139 110646 At least one allele reported by at least one project that reports frequencies does not match any of the reference or alternate alleles listed by dbSNP.
multiMap 799777 286666 This variant has been mapped to more than one distinct genomic location.
otherMapErr 91260 195051 At least one other mapping of this variant has erroneous coordinates. The mapping(s) with erroneous coordinates are excluded from this track and are included in the Map Err subtrack. Sometimes despite this mapping having legal coordinates, there may still be an issue with this mapping's coordinates and alleles; you may want to click through to dbSNP to compare the initial submission's coordinates and alleles. In hg19, 55454 distinct rsIDs are affected; in hg38, 86636.

Data Sources and Methods

dbSNP has collected genetic variant reports from researchers worldwide for more than 20 years. Since the advent of next-generation sequencing methods and the population sequencing efforts that they enable, dbSNP has grown exponentially, requiring a new data schema, computational pipeline, web infrastructure, and download files. (Holmes et al.) The same challenges of exponential growth affected UCSC's presentation of dbSNP variants, so we have taken the opportunity to change our internal representation and import pipeline. Most notably, flanking sequences are no longer provided by dbSNP, because most submissions have been genomic variant calls in VCF format as opposed to independent sequences.

We downloaded JSON files available from dbSNP at -http://ftp.ncbi.nlm.nih.gov/snp/archive/b155/JSON/, +https://ftp.ncbi.nlm.nih.gov/snp/archive/b155/JSON/, extracted a subset of the information about each variant, and collated it into a bigBed file using the bigDbSnp.as schema with the information necessary for filtering and displaying the variants, as well as a separate file containing more detailed information to be displayed on each variant's details page (dbSnpDetails.as schema).

Data Access

Note: It is not recommeneded to use LiftOver to convert SNPs between assemblies, and more information about how to convert SNPs between assemblies can be found on the following FAQ entry.

Since dbSNP has grown to include over 1 billion variants, the size of the All dbSNP (155) subtrack can cause the Table Browser and Data Integrator to time out, leading to a blank page or truncated output, unless queries are restricted to a chromosomal region, to particular defined regions, to a specific set of rs# IDs (which can be pasted/uploaded into the Table Browser), or to one of the subset tracks such as Common (~15 million variants) or ClinVar (~0.8M variants).

For automated analysis, the track data files can be downloaded from the downloads server for hg19 and hg38.
file format subtrack
dbSnp155.bb hg19 hg38 bigDbSnp (bigBed4+13) All dbSNP (155)
dbSnp155ClinVar.bb hg19 hg38 bigDbSnp (bigBed4+13) ClinVar dbSNP (155)
dbSnp155Common.bb hg19 hg38 bigDbSnp (bigBed4+13) Common dbSNP (155)
dbSnp155Mult.bb hg19 hg38 bigDbSnp (bigBed4+13) Mult. dbSNP (155)
dbSnp155BadCoords.bb hg19 hg38 bigBed4 Map Err (155)
dbSnp155Details.tab.gz gzip-compressed tab-separated text Detailed variant properties, independent of genome assembly version

Several utilities for working with bigBed-formatted binary files can be downloaded here. Run a utility with no arguments to see a brief description of the utility and its options.

Example: retrieve all variants in the region chr1:200001-200400

bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/snp/dbSnp155.bb -chrom=chr1 -start=200000 -end=200400 stdout

Example: retrieve variant rs6657048

bigBedNamedItems dbSnp155.bb rs6657048 stdout

Example: retrieve all variants with rs# IDs in a file (myIds.txt) and output to another file (dbSnp155.myIds.bed)

bigBedNamedItems -nameFile dbSnp155.bb myIds.txt dbSnp155.myIds.bed

The columns in the bigDbSnp/bigBed files and dbSnp155Details.tab.gz file are described in bigDbSnp.as and dbSnpDetails.as respectively. For columns that contain lists of allele frequency data, the order of projects providing the data listed is as follows:

  1. 1000Genomes
  2. dbGaP_PopFreq
  3. TOPMED
  4. KOREAN
  5. SGDP_PRJ
  6. Qatari
  7. NorthernSweden
  8. Siberian
  9. TWINSUK
  10. TOMMO
  11. ALSPAC
  12. GENOME_DK
  13. GnomAD
  14. GoNL
  15. Estonian
  16. -Vietnamese +Vietnamese
  17. Korea1K
  18. HapMap
  19. PRJEB36033
  20. HGDP_Stanford
  21. Daghestan
  22. PAGE_STUDY
  23. Chileans
  24. MGP
  25. PRJEB37584
  26. GoESP
  27. ExAC
  28. GnomAD_exomes
  29. FINRISK
  30. PharmGKB
  31. PRJEB37766
The functional effect (maxFuncImpact) for each variant contains the Sequence Ontology (SO) ID for the greatest functional impact on the gene. This field contains a 0 when no SO terms are annotated on the variant.

UCSC also has an API that can be used to retrieve values from a particular chromosome range.

A list of rs# IDs can be pasted/uploaded in the Variant Annotation Integrator tool to find out which genes (if any) the variants are located in, as well as functional effect such as intron, coding-synonymous, missense, frameshift, etc.

Please refer to our searchable mailing list archives for more questions and example queries, or our Data Access FAQ for more information.

References

Holmes JB, Moyer E, Phan L, Maglott D, Kattman B. SPDI: Data Model for Variants and Applications at NCBI. Bioinformatics. 2019 Nov 18;. PMID: 31738401

Sayers EW, Agarwala R, Bolton EE, Brister JR, Canese K, Clark K, Connor R, Fiorini N, Funk K, Hefferon T et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2019 Jan 8;47(D1):D23-D28. PMID: 30395293; PMC: PMC6323993

Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001 Jan 1;29(1):308-11. PMID: 11125122; PMC: PMC29783