b95ff3835509b242bd9007ba55c4f60c1022da47 markd Thu Dec 30 14:58:16 2010 -0800 moved to programs to hg/utils, fixed build of distributed utilities diff --git src/hg/overlapSelect/todo.txt src/hg/overlapSelect/todo.txt deleted file mode 100644 index 2eec8fb..0000000 --- src/hg/overlapSelect/todo.txt +++ /dev/null @@ -1,45 +0,0 @@ - -- add correlation coefficient as a criteria: - Adam Siepel <acs@soe.ucsc.edu> 2005/03/02 - On another front, it occurred to me that the correlation coefficient - sometimes used in gene prediction stats could be another useful thing - to report. For each inFile record, you could give a correlation - coefficient based on all overlapping selectFile records. This would - give you one number saying something about both directions of coverage - and about the degree of "consistency" we were talking about the other - day. For example, you could project the intronEsts into a bed of - nonoverlapping features using featureBits, then run overlapSelect -cc - (or similar), to get a cc number for each selectFile, which could then - go in a database like the one I'm building. I think, when computing - the cc, you might want to limit yourself to the range of the inFile - record. That would make sense for my application at least, where my - predictions are fragments. In other cases, you might want to compute - the cc for the smallest interval including both the inFile record and - all overlapping selectFile records. - - It looks to me like the number you'd compute for a given interval would - be - cc = (cN - ab) / sqrt(ab(N-a)(N-b)) - - where a is the number of "bits" (e.g., bases in exons) in the inFile - record, b is the total number of bits in all overlapping selectFile - records (within the interval), c is the number of bits in both the - inFile and the selectFile records, and N is the length of the interval. - - For example, if you had a 1000 base interval with 100 bases within - predicted exons, 150 bases of supporting EST evidence, and an overlap - of 90 bases, then N = 1000, a = 100, b = 150, c = 90, and cc = 0.70. - - This number is defined as long as 0 < a,b < N. It will always be true - that a > 0 (otherwise you don't have an inFile record). If a > 0 and b - = 0, then you'd have 0/0 but you could just report 0. If a = N and b - <= N (also possible), then c = b and you'd also have 0/0. You could - report b/N in this case. The symmetric thing could be done if b = N - and a <= N. - - I suppose an alternative would be to report a, b, c, and N for each - inFile record. Then the cc or some alternative could easily be - computed with an awk script. - -- add featureBits type of feature specifications (e.g. :intron) -