src/hg/makeDb/trackDb/snp130.html 1.1
1.1 2009/05/19 04:56:13 angie
Initial description.
Index: src/hg/makeDb/trackDb/snp130.html
===================================================================
RCS file: src/hg/makeDb/trackDb/snp130.html
diff -N src/hg/makeDb/trackDb/snp130.html
--- /dev/null 1 Jan 1970 00:00:00 -0000
+++ src/hg/makeDb/trackDb/snp130.html 19 May 2009 04:56:13 -0000 1.1
@@ -0,0 +1,227 @@
+<H2>Description</H2>
+
+<P>
+This track contains
+<A HREF="http://www.ncbi.nlm.nih.gov/SNP/" target=_blank>dbSNP</A>
+build 130, available from
+<A HREF="ftp://ftp.ncbi.nih.gov/snp/organisms" target=_blank>ftp.ncbi.nih.gov/snp</A>.
+</P>
+
+<H2>Interpreting and Configuring the Graphical Display</H2>
+<P>
+ Variants are shown as single tick marks at most zoom levels.
+ When viewing the track at or near base-level resolution, the displayed
+ width of the SNP corresponds to the width of the variant in the reference
+ sequence. Insertions are indicated by a single tick mark displayed between
+ two nucleotides, single nucleotide polymorphisms are displayed as the width
+ of a single base, and multiple nucleotide variants are represented by a
+ block that spans two or more bases.
+</P>
+
+<P>
+ The configuration categories reflect the following definitions (not all categories apply
+ to this assembly):
+</P>
+ <UL>
+
+ <LI>
+ <A name="Class"></A>
+ <B>Class</B>: Describes the observed alleles<BR>
+ <UL>
+ <LI><B>Single</B> - single nucleotide variation: all observed alleles are single nucleotides
+ (can have 2, 3 or 4 alleles)
+ <LI><B>In-del</B> - insertion/deletion
+ <LI><B>Heterozygous</B> - heterozygous (undetermined) variation: allele contains string '(heterozygous)'
+ <LI><B>Microsatellite</B> - the observed allele from dbSNP is variation in counts of short tandem repeats
+ <LI><B>Named</B> - the observed allele from dbSNP is given as a text name
+ <LI><B>No Variation</B> - no variation asserted for sequence
+ <LI><B>Mixed</B> - the cluster contains submissions from multiple classes
+ <LI><B>Multiple Nucleotide Polymorphism</B> - alleles of the same length, length > 1, and from set of {A,T,C,G}
+ <LI><B>Insertion</B> - the polymorphism is an insertion relative to the reference assembly
+ <LI><B>Deletion</B> - the polymorphism is a deletion relative to the reference assembly
+ <LI><B>Unknown</B> - no classification provided by data contributor
+ </UL>
+ </LI>
+
+
+ <LI>
+ <A name="Valid"></A>
+ <B><A HREF="http://www.ncbi.nlm.nih.gov/SNP/snp_legend.cgi?legend=validation"
+ target="_blank">Validation</A></B>: Method used to validate
+ the variant (<I>each variant may be validated by more than one method</I>)<BR>
+ <UL>
+ <LI><B>By Frequency</B> - at least one submitted SNP in cluster has frequency data submitted
+ <LI><B>By Cluster</B> - cluster has at least 2 submissions, with at least one submission assayed with a non-computational method
+ <LI><B>By Submitter</B> - at least one submitter SNP in cluster was validated by independent assay
+ <LI><B>By 2 Hit/2 Allele</B> - all alleles have been observed in at least 2 chromosomes
+ <LI><B>By HapMap</B> - validated by HapMap project
+ <LI><B>Unknown</B> - no validation has been reported for this variant
+ </UL>
+ </LI>
+ <LI>
+ <A name="Func"></A>
+ <B>Function</B>: Predicted functional role
+ (<I>each variant may have more than one functional role</I>)<BR>
+ <UL>
+ <LI><B>Locus Region</B> - variation within 2000 bases of gene, but not
+ in transcript (in build 127 and before, the keyword was
+ <TT>locus</TT>, but since build 128, the more specific terms
+ <TT>near-gene-3</TT> and <TT>near-gene-5</TT> are used)
+ <LI><B>Coding - Synonymous</B> - no change in peptide for allele with
+ respect to reference assembly (<TT>coding-synon</TT>)
+ <LI><B>Coding - Non-Synonymous</B> - change in peptide for allele with
+ respect to reference assembly (<TT>coding-nonsynon</TT> in build
+ 127; <TT>nonsense</TT>, <TT>missense</TT>, <TT>frameshift</TT>
+ since build 128)
+ <LI><B>Untranslated</B> - variation in transcript, but not in coding
+ region interval (<TT>untranslated</TT> in build 127;
+ <TT>untranslated-3</TT>, <TT>untranslated-5</TT> since build 128)
+ <LI><B>Intron</B> - variation in intron, but not in first two or last two bases of intron
+ <LI><B>Splice Site</B> - variation in first two or last two bases of
+ intron (<TT>splice-site</TT> in build 127; <TT>splice-3</TT>,
+ <TT>splice-5</TT> since build 128)
+ <LI><B>Reference (coding)</B> - allele observed in a coding region of
+ the reference sequence (<TT>cds-reference</TT>)
+ <LI><B>Unknown</B> - no known functional classification
+ </UL>
+ </LI>
+ <LI>
+ <A name="MolType"></A>
+ <B>Molecule Type</B>: Sample used to find this variant<BR>
+ <UL>
+ <LI><B>Genomic</B> - variant discovered using a genomic template
+ <LI><B>cDNA</B> - variant discovered using a cDNA template
+ <LI><B>Unknown</B> - sample type not known
+ </UL>
+ </LI>
+ <LI>
+ <A name="AvHet"></A>
+ <B>Average heterozygosity</B>: Calculated by dbSNP as described
+ <A HREF="http://www.ncbi.nlm.nih.gov/SNP/Hetfreq.html" target=_blank>here</A>
+ <UL>
+ <LI> Average heterozygosity should not exceed 0.5 for bi-allelic
+ single-base substitutions.
+ </UL>
+ </LI>
+ <LI>
+ <A name="Weight"></A>
+ <B>Weight</B>: Alignment quality assigned by dbSNP<BR>
+ <UL>
+ <LI>Weight can be 0, 1, 2, 3 or 10.
+ <LI>Weight = 1 are the highest quality alignments.
+ <LI>Weight = 0 and weight = 10 are excluded from the data set.
+ <LI>A filter on maximum weight value is supported, which defaults to 3.
+ </UL>
+ </LI>
+ </UL>
+
+<H2>Insertions/Deletions</H2>
+<P>
+dbSNP uses a class called 'in-del'. We compare the length of the
+reference allele to the length(s) of observed alleles; if the
+reference allele is shorter than all other observed alleles, we change
+'in-del' to 'insertion'. Likewise, if the reference allele is longer
+than all other observed alleles, we change 'in-del' to 'deletion'.
+</P>
+
+<H2>UCSC Annotations</H2>
+<P>
+UCSC checks for several unusual conditions that may indicate a problem
+with the mapping, and reports them in the Annotations section if found:
+</P>
+ <UL>
+ <LI>The dbSNP reference allele is not the same as the UCSC reference
+ allele, i.e. the bases in the mapped position range.</LI>
+ <LI>Class is single, in-del, mnp or mixed and the UCSC reference
+ allele does not match any observed allele.</LI>
+ <LI>In NCBI's alignment of flanking sequences to the genome, part
+ of the flanking sequence around the SNP does not align to
+ the genome.</LI>
+ <LI>Class is single, but the size of the mapped SNP is not one base.</LI>
+ <LI>Class is named and indicates an insertion or deletion, but the size
+ of the mapped SNP implies otherwise.</LI>
+ <LI>Class is single and the format of observed alleles is unexpected.</LI>
+ <LI>The length of the observed allele(s) is not available because it is
+ too long.</LI>
+ <LI>Multiple distinct insertion SNPs have been mapped to this location.</LI>
+ <LI>At least one observed allele contains an ambiguous IUPAC base
+ (e.g. R, Y, N).</LI>
+ </UL>
+
+Another condition, which does not necessarily imply any problem, is noted:
+ <UL>
+ <LI>Class is single and SNP is tri-allelic or quad-allelic.</LI>
+ </UL>
+
+<H2>UCSC Re-alignment of flanking sequences</H2>
+<P>
+dbSNP determines the genomic locations of SNPs by aligning their flanking
+sequences to the genome.
+UCSC displays SNPs in the locations determined by dbSNP, but does not
+have access to the alignments on which dbSNP based its mappings.
+Instead, UCSC re-aligns the flanking sequences
+to the neighboring genomic sequence for display on SNP details pages.
+While the recomputed alignments may differ from dbSNP's alignments,
+they often are informative when UCSC has annotated an unusual condition.
+</P>
+
+<H2>Data Sources</H2>
+<P>
+The data that comprise this track were extracted from database dump files
+and headers of fasta files downloaded from NCBI.
+The database dump files were downloaded from
+<A HREF="ftp://ftp.ncbi.nih.gov/snp/organisms/"
+TARGET=_BLANK>ftp://ftp.ncbi.nih.gov/snp/organisms/</A>
+<EM>organism</EM>_<EM>tax_id</EM>/database/
+(e.g. for Human, <EM>organism</EM>_<EM>tax_id</EM> = human_9606).
+The fasta files were downloaded from
+<A HREF="ftp://ftp.ncbi.nih.gov/snp/organisms/"
+TARGET=_BLANK>ftp://ftp.ncbi.nih.gov/snp/organisms/</A>
+<EM>organism</EM>_<EM>tax_id</EM>/rs_fasta/
+</P>
+ <UL>
+ <LI>Coordinates, orientation, location type and dbSNP reference allele data
+ were obtained from b130_SNPContigLoc_36_3.bcp.gz and
+ b130_SNPContigInfo_36_3.bcp.gz.
+ <LI>b130_SNPMapInfo_36_3.bcp.gz provided the alignment weights.
+ <LI>Functional classification was obtained from
+ b130_SNPContigLocusId_36_3.bcp.gz.
+ <LI>Validation status and heterozygosity were obtained from SNP.bcp.gz.
+ <LI>The header lines in the rs_fasta files were used for molecule type,
+ class and observed polymorphism.
+ </UL>
+
+<H2>Orthologous Alleles (human assemblies only)</H2>
+<P>
+Beginning with the March 2006 human assembly, we provide a related table that
+contains orthologous alleles in the chimpanzee and rhesus macaque assemblies.
+Beginning with dbSNP build 129, the orangutan assembly is also included.
+We use our liftOver utility to identify the orthologous alleles. The candidate human SNPs are
+a filtered list that meet the criteria:
+<UL>
+<LI>class = 'single'
+<LI>chromEnd = chromStart + 1
+<LI>align to just one location
+<LI>are not aligned to a chrN_random chrom
+<LI>are biallelic (not tri or quad allelic)
+</UL>
+
+In some cases the orthologous allele is unknown; these are set to 'N'.
+If a lift was not possible, we set the orthologous allele to '?' and the orthologous start and end
+position to 0 (zero).
+
+<H2>Masked FASTA Files (human assemblies only)</H2>
+
+FASTA files that have been modified to use IUPAC ambiguous nucleotide characters at
+each base covered by a single-base substitution are available for download
+<A HREF="http://hgdownload.cse.ucsc.edu/goldenPath/hg18/snp130Mask/">here</A>.
+Note that only single-base substitutions (no insertions or deletions) were used
+to mask the sequence, and these were filtered to exlcude problematic SNPs.
+
+<H2>References</H2>
+<P>
+Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K.
+<A HREF="http://nar.oxfordjournals.org/cgi/content/abstract/29/1/308" TARGET=_blank>
+dbSNP: the NCBI database of genetic variation</A>.
+<em>Nucleic Acids Res. </em>2001 Jan 1;29(1):308-11.
+