75640104127592366faa7478485e5a88f8144a0b mspeir Thu May 14 08:44:10 2026 -0700 fixing Miklem -> Micklem, refs #37539 diff --git src/utils/cpgIslandExt/README src/utils/cpgIslandExt/README index 0d985299d11..3f87f116fca 100644 --- src/utils/cpgIslandExt/README +++ src/utils/cpgIslandExt/README @@ -1,26 +1,26 @@ README (added at UCSC) "make" --> gcc readseq.c cpg_lh.c -o cpglh.exe We've been running this on hard-masked sequence (RepeatMasker and TRF with period <= 12 results are masked to 'N') in order to avoid CpG islands in Alu repeats in human. -The original cpg.c was written by Gos Miklem from the Sanger Center. +The original cpg.c was written by Gos Micklem from the Sanger Center. LaDeana Hillier added some modifications --> cpg_lh.c, and UCSC has added some further modifications to cpg_lh.c, so that its expected number of CpGs in an island is calculated as described in Gardiner-Garden, M. and M. Frommer, 1987 CpG islands in vertebrate genomes. J. Mol. Biol. 196:261-282. Expected = (Number of C's * Number of G's) / Length Instead of a sliding-window search for CpG islands, this cpg program uses a running-sum score where a 'C' followed by a 'G' increases the score by 17 and anything else decreases the score by 1. When the score transitions from positive to 0 (and at the end of the sequence), the sequence in the current span is evaluated to see if it qualifies as a CpG island (>200 bp length, >50% GC, >0.6 ratio of observed CpG to expected). Then the search recurses on the span from the position with the max running score up to the current position.