src/hg/makeDb/trackDb/snp130.html 1.1

1.1 2009/05/19 04:56:13 angie
Initial description.
Index: src/hg/makeDb/trackDb/snp130.html
===================================================================
RCS file: src/hg/makeDb/trackDb/snp130.html
diff -N src/hg/makeDb/trackDb/snp130.html
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ src/hg/makeDb/trackDb/snp130.html	19 May 2009 04:56:13 -0000	1.1
@@ -0,0 +1,227 @@
+<H2>Description</H2>
+
+<P>
+This track contains
+<A HREF="http://www.ncbi.nlm.nih.gov/SNP/" target=_blank>dbSNP</A>
+build 130, available from
+<A HREF="ftp://ftp.ncbi.nih.gov/snp/organisms" target=_blank>ftp.ncbi.nih.gov/snp</A>.
+</P>
+
+<H2>Interpreting and Configuring the Graphical Display</H2>
+<P>
+  Variants are shown as single tick marks at most zoom levels.
+  When viewing the track at or near base-level resolution, the displayed
+  width of the SNP corresponds to the width of the variant in the reference
+  sequence. Insertions are indicated by a single tick mark displayed between
+  two nucleotides, single nucleotide polymorphisms are displayed as the width 
+  of a single base, and multiple nucleotide variants are represented by a 
+  block that spans two or more bases.
+</P>
+
+<P>
+  The configuration categories reflect the following definitions (not all categories apply
+  to this assembly):
+</P>
+  <UL>
+
+    <LI>
+      <A name="Class"></A>
+      <B>Class</B>: Describes the observed alleles<BR>
+      <UL>
+        <LI><B>Single</B> - single nucleotide variation: all observed alleles are single nucleotides
+	    (can have 2, 3 or 4 alleles)
+        <LI><B>In-del</B> - insertion/deletion
+        <LI><B>Heterozygous</B> - heterozygous (undetermined) variation: allele contains string '(heterozygous)'
+        <LI><B>Microsatellite</B> - the observed allele from dbSNP is variation in counts of short tandem repeats
+        <LI><B>Named</B> - the observed allele from dbSNP is given as a text name
+        <LI><B>No Variation</B> - no variation asserted for sequence
+        <LI><B>Mixed</B> - the cluster contains submissions from multiple classes
+        <LI><B>Multiple Nucleotide Polymorphism</B> - alleles of the same length, length > 1, and from set of {A,T,C,G}
+        <LI><B>Insertion</B> - the polymorphism is an insertion relative to the reference assembly
+        <LI><B>Deletion</B> - the polymorphism is a deletion relative to the reference assembly
+        <LI><B>Unknown</B> - no classification provided by data contributor
+      </UL>
+    </LI>
+
+
+    <LI>
+      <A name="Valid"></A>
+      <B><A HREF="http://www.ncbi.nlm.nih.gov/SNP/snp_legend.cgi?legend=validation" 
+	target="_blank">Validation</A></B>: Method used to validate
+	the variant (<I>each variant may be validated by more than one method</I>)<BR>
+        <UL>
+        <LI><B>By Frequency</B> - at least one submitted SNP in cluster has frequency data submitted
+        <LI><B>By Cluster</B> - cluster has at least 2 submissions, with at least one submission assayed with a non-computational method
+        <LI><B>By Submitter</B> - at least one submitter SNP in cluster was validated by independent assay
+        <LI><B>By 2 Hit/2 Allele</B> - all alleles have been observed in at least 2 chromosomes
+        <LI><B>By HapMap</B> - validated by HapMap project
+        <LI><B>Unknown</B> - no validation has been reported for this variant
+      </UL>
+    </LI>
+    <LI>
+      <A name="Func"></A>
+      <B>Function</B>: Predicted functional role 
+	(<I>each variant may have more than one functional role</I>)<BR>
+      <UL>
+        <LI><B>Locus Region</B> - variation within 2000 bases of gene, but not 
+	    in transcript (in build 127 and before, the keyword was 
+	    <TT>locus</TT>, but since build 128, the more specific terms 
+	    <TT>near-gene-3</TT> and <TT>near-gene-5</TT> are used)
+        <LI><B>Coding - Synonymous</B> - no change in peptide for allele with 
+	    respect to reference assembly (<TT>coding-synon</TT>)
+        <LI><B>Coding - Non-Synonymous</B> - change in peptide for allele with 
+	    respect to reference assembly (<TT>coding-nonsynon</TT> in build 
+	    127; <TT>nonsense</TT>, <TT>missense</TT>, <TT>frameshift</TT> 
+	    since build 128)
+        <LI><B>Untranslated</B> - variation in transcript, but not in coding 
+	    region interval (<TT>untranslated</TT> in build 127; 
+	    <TT>untranslated-3</TT>, <TT>untranslated-5</TT> since build 128)
+        <LI><B>Intron</B> - variation in intron, but not in first two or last two bases of intron
+        <LI><B>Splice Site</B> - variation in first two or last two bases of 
+	    intron (<TT>splice-site</TT> in build 127; <TT>splice-3</TT>, 
+	    <TT>splice-5</TT> since build 128)
+        <LI><B>Reference (coding)</B> - allele observed in a coding region of 
+	    the reference sequence (<TT>cds-reference</TT>)
+        <LI><B>Unknown</B> - no known functional classification
+      </UL>
+    </LI>
+    <LI>
+      <A name="MolType"></A>
+      <B>Molecule Type</B>: Sample used to find this variant<BR>
+      <UL>
+        <LI><B>Genomic</B> - variant discovered using a genomic template
+        <LI><B>cDNA</B> - variant discovered using a cDNA template
+        <LI><B>Unknown</B> - sample type not known
+      </UL>
+    </LI>
+    <LI>
+      <A name="AvHet"></A>
+      <B>Average heterozygosity</B>: Calculated by dbSNP as described 
+      <A HREF="http://www.ncbi.nlm.nih.gov/SNP/Hetfreq.html" target=_blank>here</A>
+      <UL>
+      <LI> Average heterozygosity should not exceed 0.5 for bi-allelic 
+           single-base substitutions.
+      </UL>
+    </LI>
+    <LI>
+      <A name="Weight"></A>
+      <B>Weight</B>: Alignment quality assigned by dbSNP<BR>
+      <UL>
+      <LI>Weight can be 0, 1, 2, 3 or 10.   
+      <LI>Weight = 1 are the highest quality alignments.
+      <LI>Weight = 0 and weight = 10 are excluded from the data set.
+      <LI>A filter on maximum weight value is supported, which defaults to 3.
+      </UL>
+    </LI>
+  </UL>
+
+<H2>Insertions/Deletions</H2>
+<P>
+dbSNP uses a class called 'in-del'.  We compare the length of the
+reference allele to the length(s) of observed alleles; if the
+reference allele is shorter than all other observed alleles, we change
+'in-del' to 'insertion'.  Likewise, if the reference allele is longer
+than all other observed alleles, we change 'in-del' to 'deletion'.
+</P>
+
+<H2>UCSC Annotations</H2>
+<P>
+UCSC checks for several unusual conditions that may indicate a problem 
+with the mapping, and reports them in the Annotations section if found:
+</P>
+  <UL>
+  <LI>The dbSNP reference allele is not the same as the UCSC reference
+      allele, i.e. the bases in the mapped position range.</LI>
+  <LI>Class is single, in-del, mnp or mixed and the UCSC reference
+      allele does not match any observed allele.</LI>
+  <LI>In NCBI's alignment of flanking sequences to the genome, part
+      of the flanking sequence around the SNP does not align to
+      the genome.</LI>
+  <LI>Class is single, but the size of the mapped SNP is not one base.</LI>
+  <LI>Class is named and indicates an insertion or deletion, but the size
+      of the mapped SNP implies otherwise.</LI>
+  <LI>Class is single and the format of observed alleles is unexpected.</LI>
+  <LI>The length of the observed allele(s) is not available because it is
+      too long.</LI>
+  <LI>Multiple distinct insertion SNPs have been mapped to this location.</LI>
+  <LI>At least one observed allele contains an ambiguous IUPAC base 
+      (e.g. R, Y, N).</LI>
+  </UL>
+
+Another condition, which does not necessarily imply any problem, is noted:
+  <UL>
+  <LI>Class is single and SNP is tri-allelic or quad-allelic.</LI>
+  </UL>
+
+<H2>UCSC Re-alignment of flanking sequences</H2>
+<P>
+dbSNP determines the genomic locations of SNPs by aligning their flanking 
+sequences to the genome.
+UCSC displays SNPs in the locations determined by dbSNP, but does not
+have access to the alignments on which dbSNP based its mappings.
+Instead, UCSC re-aligns the flanking sequences 
+to the neighboring genomic sequence for display on SNP details pages.  
+While the recomputed alignments may differ from dbSNP's alignments,
+they often are informative when UCSC has annotated an unusual condition.
+</P>
+
+<H2>Data Sources</H2>
+<P>
+The data that comprise this track were extracted from database dump files 
+and headers of fasta files downloaded from NCBI.  
+The database dump files were downloaded from 
+<A HREF="ftp://ftp.ncbi.nih.gov/snp/organisms/"
+TARGET=_BLANK>ftp://ftp.ncbi.nih.gov/snp/organisms/</A>
+<EM>organism</EM>_<EM>tax_id</EM>/database/
+(e.g. for Human, <EM>organism</EM>_<EM>tax_id</EM> = human_9606).
+The fasta files were downloaded from 
+<A HREF="ftp://ftp.ncbi.nih.gov/snp/organisms/"
+TARGET=_BLANK>ftp://ftp.ncbi.nih.gov/snp/organisms/</A>
+<EM>organism</EM>_<EM>tax_id</EM>/rs_fasta/
+</P>
+  <UL>
+  <LI>Coordinates, orientation, location type and dbSNP reference allele data
+      were obtained from b130_SNPContigLoc_36_3.bcp.gz and 
+      b130_SNPContigInfo_36_3.bcp.gz.  
+  <LI>b130_SNPMapInfo_36_3.bcp.gz provided the alignment weights.
+  <LI>Functional classification was obtained from 
+      b130_SNPContigLocusId_36_3.bcp.gz.
+  <LI>Validation status and heterozygosity were obtained from SNP.bcp.gz.
+  <LI>The header lines in the rs_fasta files were used for molecule type,
+      class and observed polymorphism.
+  </UL>
+
+<H2>Orthologous Alleles (human assemblies only)</H2>
+<P>
+Beginning with the March 2006 human assembly, we provide a related table that 
+contains orthologous alleles in the chimpanzee and rhesus macaque assemblies.
+Beginning with dbSNP build 129, the orangutan assembly is also included.
+We use our liftOver utility to identify the orthologous alleles.  The candidate human SNPs are 
+a filtered list that meet the criteria:
+<UL>
+<LI>class = 'single'
+<LI>chromEnd = chromStart + 1
+<LI>align to just one location
+<LI>are not aligned to a chrN_random chrom
+<LI>are biallelic (not tri or quad allelic)
+</UL>
+
+In some cases the orthologous allele is unknown; these are set to 'N'.
+If a lift was not possible, we set the orthologous allele to '?' and the orthologous start and end 
+position to 0 (zero).
+
+<H2>Masked FASTA Files (human assemblies only)</H2>
+
+FASTA files that have been modified to use IUPAC ambiguous nucleotide characters at
+each base covered by a single-base substitution are available for download
+<A HREF="http://hgdownload.cse.ucsc.edu/goldenPath/hg18/snp130Mask/">here</A>.
+Note that only single-base substitutions (no insertions or deletions) were used
+to mask the sequence, and these were filtered to exlcude problematic SNPs.
+
+<H2>References</H2>
+<P>
+Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. 
+<A HREF="http://nar.oxfordjournals.org/cgi/content/abstract/29/1/308" TARGET=_blank>
+dbSNP: the NCBI database of genetic variation</A>.
+<em>Nucleic Acids Res. </em>2001 Jan 1;29(1):308-11.
+