d43ec6011a8adc1ba82e3b2c2473e2043d300c03 hiram Thu Apr 19 14:10:07 2012 -0700 UCSC source from CVS source tree 2005 version diff --git src/utils/cpgIslandExt/README src/utils/cpgIslandExt/README new file mode 100644 index 0000000..0d98529 --- /dev/null +++ src/utils/cpgIslandExt/README @@ -0,0 +1,26 @@ +README (added at UCSC) + +"make" --> gcc readseq.c cpg_lh.c -o cpglh.exe + +We've been running this on hard-masked sequence (RepeatMasker and TRF +with period <= 12 results are masked to 'N') in order to avoid CpG +islands in Alu repeats in human. + +The original cpg.c was written by Gos Miklem from the Sanger Center. +LaDeana Hillier added some modifications --> cpg_lh.c, and UCSC has +added some further modifications to cpg_lh.c, so that its expected +number of CpGs in an island is calculated as described in + Gardiner-Garden, M. and M. Frommer, 1987 + CpG islands in vertebrate genomes. J. Mol. Biol. 196:261-282. + + Expected = (Number of C's * Number of G's) / Length + +Instead of a sliding-window search for CpG islands, this cpg program +uses a running-sum score where a 'C' followed by a 'G' increases the +score by 17 and anything else decreases the score by 1. When the +score transitions from positive to 0 (and at the end of the sequence), +the sequence in the current span is evaluated to see if it qualifies +as a CpG island (>200 bp length, >50% GC, >0.6 ratio of observed CpG +to expected). Then the search recurses on the span from the position +with the max running score up to the current position. +