src/hg/makeDb/trackDb/snp128.html 1.11

1.11 2009/06/22 17:57:22 kuhn
removed capitals on Single Nuc Polymorphism
Index: src/hg/makeDb/trackDb/snp128.html
===================================================================
RCS file: /projects/compbio/cvsroot/kent/src/hg/makeDb/trackDb/snp128.html,v
retrieving revision 1.10
retrieving revision 1.11
diff -b -B -U 1000000 -r1.10 -r1.11
--- src/hg/makeDb/trackDb/snp128.html	22 Jun 2009 17:55:14 -0000	1.10
+++ src/hg/makeDb/trackDb/snp128.html	22 Jun 2009 17:57:22 -0000	1.11
@@ -1,228 +1,228 @@
 <H2>Description</H2>
 
 <P>
-This track contains information about Single Nucleotide Polymorphisms
+This track contains information about single nucleotide polymorphisms
 and small insertions and deletions (indels) &mdash; collectively Simple
 Nucleotoide Polymorphisms &mdash; from
 <A HREF="http://www.ncbi.nlm.nih.gov/SNP/" target=_blank>dbSNP</A>
 build 128, available from
 <A HREF="ftp://ftp.ncbi.nih.gov/snp/organisms" target=_blank>ftp.ncbi.nih.gov/snp</A>.
 </P>
 
 <H2>Interpreting and Configuring the Graphical Display</H2>
 <P>
   Variants are shown as single tick marks at most zoom levels.
   When viewing the track at or near base-level resolution, the displayed
   width of the SNP corresponds to the width of the variant in the reference
   sequence. Insertions are indicated by a single tick mark displayed between
   two nucleotides, single nucleotide polymorphisms are displayed as the width 
   of a single base, and multiple nucleotide variants are represented by a 
   block that spans two or more bases.
 </P>
 
 <P>
   The configuration categories reflect the following definitions (not all categories apply
   to this assembly):
 </P>
   <UL>
 
     <LI>
       <A name="Class"></A>
       <B>Class</B>: Describes the observed alleles<BR>
       <UL>
         <LI><B>Single</B> - single nucleotide variation: all observed alleles are single nucleotides
 	    (can have 2, 3 or 4 alleles)
         <LI><B>In-del</B> - insertion/deletion
         <LI><B>Heterozygous</B> - heterozygous (undetermined) variation: allele contains string '(heterozygous)'
         <LI><B>Microsatellite</B> - the observed allele from dbSNP is variation in counts of short tandem repeats
         <LI><B>Named</B> - the observed allele from dbSNP is given as a text name
         <LI><B>No Variation</B> - no variation asserted for sequence
         <LI><B>Mixed</B> - the cluster contains submissions from multiple classes
         <LI><B>Multiple Nucleotide Polymorphism</B> - alleles of the same length, length > 1, and from set of {A,T,C,G}
         <LI><B>Insertion</B> - the polymorphism is an insertion relative to the reference assembly
         <LI><B>Deletion</B> - the polymorphism is a deletion relative to the reference assembly
         <LI><B>Unknown</B> - no classification provided by data contributor
       </UL>
     </LI>
 
 
     <LI>
       <A name="Valid"></A>
       <B><A HREF="http://www.ncbi.nlm.nih.gov/SNP/snp_legend.cgi?legend=validation" 
 	target="_blank">Validation</A></B>: Method used to validate
 	the variant (<I>each variant may be validated by more than one method</I>)<BR>
         <UL>
         <LI><B>By Frequency</B> - at least one submitted SNP in cluster has frequency data submitted
         <LI><B>By Cluster</B> - cluster has at least 2 submissions, with at least one submission assayed with a non-computational method
         <LI><B>By Submitter</B> - at least one submitter SNP in cluster was validated by independent assay
         <LI><B>By 2 Hit/2 Allele</B> - all alleles have been observed in at least 2 chromosomes
         <LI><B>By HapMap</B> - validated by HapMap project
         <LI><B>Unknown</B> - no validation has been reported for this variant
       </UL>
     </LI>
     <LI>
       <A name="Func"></A>
       <B>Function</B>: Predicted functional role 
 	(<I>each variant may have more than one functional role</I>)<BR>
       <UL>
         <LI><B>Locus Region</B> - variation within 2000 bases of gene, but not 
 	    in transcript (in build 127 and before, the keyword was 
 	    <TT>locus</TT>, but in build 128, the more specific terms 
 	    <TT>near-gene-3</TT> and <TT>near-gene-5</TT> are used)
         <LI><B>Coding - Synonymous</B> - no change in peptide for allele with 
 	    respect to reference assembly (<TT>coding-synon</TT>)
         <LI><B>Coding - Non-Synonymous</B> - change in peptide for allele with 
 	    respect to reference assembly (<TT>coding-nonsynon</TT> in build 
 	    127; <TT>nonsense</TT>, <TT>missense</TT>, <TT>frameshift</TT> 
 	    in build 128)
         <LI><B>Untranslated</B> - variation in transcript, but not in coding 
 	    region interval (<TT>untranslated</TT> in build 127; 
 	    <TT>untranslated-3</TT>, <TT>untranslated-5</TT> in build 128)
         <LI><B>Intron</B> - variation in intron, but not in first two or last two bases of intron
         <LI><B>Splice Site</B> - variation in first two or last two bases of 
 	    intron (<TT>splice-site</TT> in build 127; <TT>splice-3</TT>, 
 	    <TT>splice-5</TT> in build 128)
         <LI><B>Reference (coding)</B> - allele observed in a coding region of 
 	    the reference sequence (<TT>cds-reference</TT>)
         <LI><B>Unknown</B> - no known functional classification
       </UL>
     </LI>
     <LI>
       <A name="MolType"></A>
       <B>Molecule Type</B>: Sample used to find this variant<BR>
       <UL>
         <LI><B>Genomic</B> - variant discovered using a genomic template
         <LI><B>cDNA</B> - variant discovered using a cDNA template
         <LI><B>Unknown</B> - sample type not known
       </UL>
     </LI>
     <LI>
       <A name="AvHet"></A>
       <B>Average heterozygosity</B>: Calculated by dbSNP as described 
       <A HREF="http://www.ncbi.nlm.nih.gov/SNP/Hetfreq.html" target=_blank>here</A>
       <UL>
       <LI> Average heterozygosity should not exceed 0.5 for bi-allelic 
            single-base substitutions.
       </UL>
     </LI>
     <LI>
       <A name="Weight"></A>
       <B>Weight</B>: Alignment quality assigned by dbSNP<BR>
       <UL>
       <LI>Weight can be 0, 1, 2, 3 or 10.   
       <LI>Weight = 1 are the highest quality alignments.
       <LI>Weight = 0 and weight = 10 are excluded from the data set.
       <LI>A filter on maximum weight value is supported, which defaults to 3.
       </UL>
     </LI>
   </UL>
 
 <H2>Insertions/Deletions</H2>
 <P>
 dbSNP uses a class called 'in-del'.  We compare the length of the
 reference allele to the length(s) of observed alleles; if the
 reference allele is shorter than all other observed alleles, we change
 'in-del' to 'insertion'.  Likewise, if the reference allele is longer
 than all other observed alleles, we change 'in-del' to 'deletion'.
 </P>
 
 <H2>UCSC Annotations</H2>
 <P>
 UCSC checks for several unusual conditions that may indicate a problem 
 with the mapping, and reports them in the Annotations section if found:
 </P>
   <UL>
   <LI>The dbSNP reference allele is not the same as the UCSC reference
       allele, i.e. the bases in the mapped position range.</LI>
   <LI>Class is single, in-del, mnp or mixed and the UCSC reference
       allele does not match any observed allele.</LI>
   <LI>In NCBI's alignment of flanking sequences to the genome, part
       of the flanking sequence around the SNP does not align to
       the genome.</LI>
   <LI>Class is single, but the size of the mapped SNP is not one base.</LI>
   <LI>Class is named and indicates an insertion or deletion, but the size
       of the mapped SNP implies otherwise.</LI>
   <LI>Class is single and the format of observed alleles is unexpected.</LI>
   <LI>The length of the observed allele(s) is not available because it is
       too long.</LI>
   <LI>Multiple distinct insertion SNPs have been mapped to this location.</LI>
   <LI>At least one observed allele contains an ambiguous IUPAC base 
       (e.g. R, Y, N).</LI>
   </UL>
 
 Another condition, which does not necessarily imply any problem, is noted:
   <UL>
   <LI>Class is single and SNP is tri-allelic or quad-allelic.</LI>
   </UL>
 
 <H2>UCSC Re-alignment of flanking sequences</H2>
 <P>
 dbSNP determines the genomic locations of SNPs by aligning their flanking 
 sequences to the genome.
 UCSC displays SNPs in the locations determined by dbSNP, but does not
 have access to the alignments on which dbSNP based its mappings.
 Instead, UCSC re-aligns the flanking sequences 
 to the neighboring genomic sequence for display on SNP details pages.  
 While the recomputed alignments may differ from dbSNP's alignments,
 they often are informative when UCSC has annotated an unusual condition.
 </P>
 
 <H2>Data Sources</H2>
 <P>
 The data that comprise this track were extracted from database dump files 
 and headers of fasta files downloaded from NCBI.  
 The database dump files were downloaded from 
 <A HREF="ftp://ftp.ncbi.nih.gov/snp/organisms/"
 TARGET=_BLANK>ftp://ftp.ncbi.nih.gov/snp/organisms/</A>
 <EM>organism</EM>_<EM>tax_id</EM>/database/
 (e.g. for Human, <EM>organism</EM>_<EM>tax_id</EM> = human_9606).
 The fasta files were downloaded from 
 <A HREF="ftp://ftp.ncbi.nih.gov/snp/organisms/"
 TARGET=_BLANK>ftp://ftp.ncbi.nih.gov/snp/organisms/</A>
 <EM>organism</EM>_<EM>tax_id</EM>/rs_fasta/
 </P>
   <UL>
   <LI>Coordinates, orientation, location type and dbSNP reference allele data
       were obtained from b128_SNPContigLoc_36_2.bcp.gz and 
       b128_SNPContigInfo_36_2.bcp.gz.  
   <LI>b128_SNPMapInfo_36_2.bcp.gz provided the alignment weights.
   <LI>Functional classification was obtained from 
       b128_SNPContigLocusId_36_2.bcp.gz.
   <LI>Validation status and heterozygosity were obtained from SNP.bcp.gz.
   <LI>The header lines in the rs_fasta files were used for molecule type,
       class and observed polymorphism.
   </UL>
 
 <H2>Orthologous Alleles (human only)</H2>
 <P>
 Beginning with the March 2006 human assembly, we provide a related table that 
 contains orthologous alleles in the chimpanzee and rhesus macaque assemblies.
 We use our liftOver utility to identify the orthologous alleles.  The candidate human SNPs are 
 a filtered list that meet the criteria:
 <UL>
 <LI>class = 'single'
 <LI>chromEnd = chromStart + 1
 <LI>align to just one location
 <LI>are not aligned to a chrN_random chrom
 <LI>are biallelic (not tri or quad allelic)
 </UL>
 
 In some cases the orthologous allele is unknown; these are set to 'N'.
 If a lift was not possible, we set the orthologous allele to '?' and the orthologous start and end 
 position to 0 (zero).
 
 <H2>Masked FASTA Files (human only)</H2>
 
 FASTA files that have been modified to use IUPAC ambiguous nucleotide characters at
 each base covered by a single-base substitution are available for download
 <A HREF="http://hgdownload.cse.ucsc.edu/goldenPath/hg18/snp128Mask/">here</A>.
 Note that only single-base substitutions (no insertions or deletions) were used
 to mask the sequence, and these were filtered to exlcude problematic SNPs.
 
 <H2>References</H2>
 <P>
 Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. 
 <A HREF="http://nar.oxfordjournals.org/cgi/content/abstract/29/1/308" TARGET=_blank>
 dbSNP: the NCBI database of genetic variation</A>.
 <em>Nucleic Acids Res. </em>2001 Jan 1;29(1):308-11.