09a80236f1b65c47bb2887a2463986f152d2b191 dschmelt Tue Jul 6 13:54:28 2021 -0700 Code review changes to Data Access and longLabel refs #27802 diff --git src/hg/makeDb/trackDb/dbSnpArchive.html src/hg/makeDb/trackDb/dbSnpArchive.html index c8f4ef8..6394ef1 100755 --- src/hg/makeDb/trackDb/dbSnpArchive.html +++ src/hg/makeDb/trackDb/dbSnpArchive.html @@ -1,503 +1,509 @@

Description

This composite track contains information about single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels) — collectively Simple Nucleotide Polymorphisms — from dbSNP, available from ftp.ncbi.nih.gov/snp. You can click into each track for a version/subset-specific description.

This collection includes numbered versions of the entire dbSNP datasets (All SNP) as well as three tracks with subsets of the items in that version. Here is information on each of the subsets:

The default maximum weight for this track is 1, so unless the setting is changed in the track controls, SNPs that map to multiple genomic locations will be omitted from display. When a SNP's flanking sequences map to multiple locations in the reference genome, it calls into question whether there is true variation at those sites, or whether the sequences at those sites are merely highly similar but not identical.

Interpreting and Configuring the Graphical Display

Variants are shown as single tick marks at most zoom levels. When viewing the track at or near base-level resolution, the displayed width of the SNP corresponds to the width of the variant in the reference sequence. Insertions are indicated by a single tick mark displayed between two nucleotides, single nucleotide polymorphisms are displayed as the width of a single base, and multiple nucleotide variants are represented by a block that spans two or more bases.

On the track controls page, SNPs can be colored and/or filtered from the display according to several attributes:

Several other properties do not have coloring options, but do have some filtering options:

You can configure this track such that the details page displays the function and coding differences relative to particular gene sets. Choose the gene sets from the list on the SNP configuration page displayed beneath this heading: On details page, show function and coding differences relative to. When one or more gene tracks are selected, the SNP details page lists all genes that the SNP hits (or is close to), with the same keywords used in the function category. The function usually agrees with NCBI's function, except when NCBI's functional annotation is relative to an XM_* predicted RefSeq (not included in the UCSC Genome Browser's RefSeq Genes track) and/or UCSC's functional annotation is relative to a transcript that is not in RefSeq.

Insertions/Deletions

dbSNP uses a class called 'in-del'. We compare the length of the reference allele to the length(s) of observed alleles; if the reference allele is shorter than all other observed alleles, we change 'in-del' to 'insertion'. Likewise, if the reference allele is longer than all other observed alleles, we change 'in-del' to 'deletion'.

UCSC Re-alignment of flanking sequences

dbSNP determines the genomic locations of SNPs by aligning their flanking sequences to the genome. UCSC displays SNPs in the locations determined by dbSNP, but does not have access to the alignments on which dbSNP based its mappings. Instead, UCSC re-aligns the flanking sequences to the neighboring genomic sequence for display on SNP details pages. While the recomputed alignments may differ from dbSNP's alignments, they often are informative when UCSC has annotated an unusual condition.

Non-repetitive genomic sequence is shown in upper case like the flanking sequence, and a "|" indicates each match between genomic and flanking bases. Repetitive genomic sequence (annotated by RepeatMasker and/or the Tandem Repeats Finder with period <= 12) is shown in lower case, and matching bases are indicated by a "+".

Data Sources and Methods

The data that comprise this track were extracted from database dump files and headers of fasta files downloaded from NCBI. The database dump files were downloaded from ftp://ftp.ncbi.nih.gov/snp/organisms/ organism_tax_id/database/ (for human, organism_tax_id = human_9606; for mouse, organism_tax_id = mouse_10090). The fasta files were downloaded from ftp://ftp.ncbi.nih.gov/snp/organisms/ organism_tax_id/rs_fasta/

Data Access

-The raw data can be explored interactively with the Table Browser, -Data Integrator, or Variant Annotation Integrator. -For automated analysis, the genome annotation can be downloaded from the downloads server for hg38, - hg19, mm10, -susScr3, -bosTau7, and galGal4 -(snp*.txt.gz) or the public MySQL server. -You can also make queries using the UCSC Genome Browser JSON APIPlease refer to our mailing list archives -for questions and example queries, or our Data Access FAQ for more information. +The raw data can be explored interactively with the +Table Browser, +Data Integrator, or +Variant Annotation Integrator. +For automated analysis, the genome annotation files can be downloaded in their entirety for +hg38, + hg19, +and mm10 as +(snp*.txt.gz). +You can also make queries using the UCSC Genome Browser +JSON API or +public MySQL server. Please refer to our +mailing list archives +for questions and example queries, or our +Data Access FAQ for more information.

Orthologous Alleles (human assemblies only)

For the human assembly, we provide a related table that contains orthologous alleles in the chimpanzee, orangutan and rhesus macaque reference genome assemblies. We use our liftOver utility to identify the orthologous alleles. The candidate human SNPs are a filtered list that meet the criteria:

In some cases the orthologous allele is unknown; these are set to 'N'. If a lift was not possible, we set the orthologous allele to '?' and the orthologous start and end position to 0 (zero).

Masked FASTA Files (human assemblies only)

FASTA files that have been modified to use IUPAC ambiguous nucleotide characters at each base covered by a single-base substitution are available for download in the genome's snp*Mask folder. Note that only single-base substitutions (no insertions or deletions) were used to mask the sequence, and these were filtered to exlcude problematic SNPs.

References

Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001 Jan 1;29(1):308-11. PMID: 11125122; PMC: PMC29783