src/hg/makeDb/trackDb/cpgIslandSuper.html c572830f58d57d62ce6d87db55efeedaa5bf6f7a

c572830f58d57d62ce6d87db55efeedaa5bf6f7a
hiram
  Wed Nov 17 10:51:49 2021 -0800
add note about the calculation command and where to obtain source for the program refs #28401

diff --git src/hg/makeDb/trackDb/cpgIslandSuper.html src/hg/makeDb/trackDb/cpgIslandSuper.html
index 8106cfc..f139f5b 100644
--- src/hg/makeDb/trackDb/cpgIslandSuper.html
+++ src/hg/makeDb/trackDb/cpgIslandSuper.html
@@ -1,86 +1,102 @@
 <h2>Description</h2>
 
 <p>CpG islands are associated with genes, particularly housekeeping
 genes, in vertebrates.  CpG islands are typically common near
 transcription start sites and may be associated with promoter
 regions.  Normally a C (cytosine) base followed immediately by a 
 G (guanine) base (a CpG) is rare in
 vertebrate DNA because the Cs in such an arrangement tend to be
 methylated.  This methylation helps distinguish the newly synthesized
 DNA strand from the parent strand, which aids in the final stages of
 DNA proofreading after duplication.  However, over evolutionary time,
 methylated Cs tend to turn into Ts because of spontaneous
 deamination.  The result is that CpGs are relatively rare unless
 there is selective pressure to keep them or a region is not methylated
 for some other reason, perhaps having to do with the regulation of gene
 expression.  CpG islands are regions where CpGs are present at
 significantly higher levels than is typical for the genome as a whole.</p>
 
 <p>
 The unmasked version of the track displays potential CpG islands
 that exist in repeat regions and would otherwise not be visible
 in the repeat masked version.
 </p>
 
 <p>
 By default, only the masked version of the track is displayed.  To view the
 unmasked version, change the visibility settings in the track controls at
 the top of this page.
 </p>
 
 <h2>Methods</h2>
 
 <p>CpG islands were predicted by searching the sequence one base at a
 time, scoring each dinucleotide (+17 for CG and -1 for others) and
 identifying maximally scoring segments.  Each segment was then
 evaluated for the following criteria:
 
 <ul>
 	<li>GC content of 50% or greater</li>
 	<li>length greater than 200 bp</li>
 	<li>ratio greater than 0.6 of observed number of CG dinucleotides to the expected number on the 
 	basis of the number of Gs and Cs in the segment</li>
 </ul>
 </p>
 <p>
 The entire genome sequence, masking areas included, was
 used for the construction of the  track <em>Unmasked CpG</em>.
 The track <em>CpG Islands</em> is constructed on the sequence after
 all masked sequence is removed.
 </p>
 
 <p>The CpG count is the number of CG dinucleotides in the island.  
 The Percentage CpG is the ratio of CpG nucleotide bases
 (twice the CpG count) to the length.  The ratio of observed to expected 
 CpG is calculated according to the formula (cited in 
 Gardiner-Garden <em>et al</em>. (1987)):
 
 <pre>    Obs/Exp CpG = Number of CpG * N / (Number of C * Number of G)</pre>
 
 where N = length of sequence.</p>
+<p>
+The calculation of the track data is performed by the following command sequence:
+<pre>
+twoBitToFa <em>assembly.2bit</em> stdout | maskOutFa stdin hard stdout \
+  | cpg_lh /dev/stdin 2&gt; cpg_lh.err \
+    |  awk '{&dollar;2 = &dollar;2 - 1; width = &dollar;3 - &dollar;2;  printf("%s\t%d\t%s\t%s %s\t%s\t%s\t%0.0f\t%0.1f\t%s\t%s\n", &dollar;1, &dollar;2, &dollar;3, &dollar;5, &dollar;6, width, &dollar;6, width*&dollar;7*0.01, 100.0*2*&dollar;6/width, &dollar;7, &dollar;9);}' \
+     | sort -k1,1 -k2,2n &gt; cpgIsland.bed
+</pre>
+The <em>unmasked</em> track data is constructed from
+<em>twoBitToFa -noMask</em> output for the <em>twoBitToFa</em> command.
+</p>
 
 <h2>Data access</h2>
 <p>
 CpG islands and its associated tables can be explored interactively using the
 <a href="../goldenPath/help/api.html" target="_blank">REST API</a>, the
 <a href="/cgi-bin/hgTables" target="_blank">Table Browser</a> or the
 <a href="/cgi-bin/hgIntegrator" target="_blank">Data Integrator</a>.
 All the tables can also be queried directly from our public MySQL
 servers, with more information available on our
 <a target="_blank" href="/goldenPath/help/mysql.html">help page</a> as well as on
 <a target="_blank" href="http://genome.ucsc.edu/blog/tag/mysql/">our blog</a>.</p>
+<p>
+The source for the <em>cpg_lh</em> program can be obtained from
+<a href="https://genome-source.gi.ucsc.edu/gitlist/kent.git/tree/master/src/utils/cpgIslandExt/" target=_blank>src/utils/cpgIslandExt/</a>.
+The <em>cpg_lh</em> program binary can be obtained from: <a href="http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/cpg_lh" download="cpg_lh">http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/cpg_lh</a> (choose "save file")
+</p>
 
 <h2>Credits</h2>
 
 <p>This track was generated using a modification of a program developed by G. Miklem and L. Hillier 
 (unpublished).</p>
 
 <h2>References</h2>
 
 <p>
 Gardiner-Garden M, Frommer M.
 <a href="https://www.sciencedirect.com/science/article/pii/0022283687906899" target="_blank">
 CpG islands in vertebrate genomes</a>.
 <em>J Mol Biol</em>. 1987 Jul 20;196(2):261-82.
 PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/3656447" target="_blank">3656447</a>
 </p>