6974f14d41a28ed1198110de47946370cfc2bb73 angie Fri Nov 22 10:29:44 2019 -0800 dbSnp153: When different freq alleles can be expanded by different amounts, pad out the smaller range & alleles for consistency. refs #23283 One variant (rs782394990) was dropped because of that quirk, but now no variants are dropped. :) A few ucscNotes increased by 1 or 2. Some of the counts of various ucscNotes were a little out of date due to recent work on adding freq notes instead of dropping; up to date now. I was all proud of this but it's a dead end. One b153 variant, rs782394990, errored out because its freq alleles couldn't be expanded by the same amount. The insertions of pure A's could be expanded to a larger range than an ins that included a T in addition to A's. So I siezed upon the problem of independently trimmed and expanded SPDIs resulting in inconsistent dels, and set about trimming by list instead of trimming each SPDI separately. I would restrict the expansion to the minimum range by which any allele could be shifted. That would fix the problem of independent del's for rs782394990. However, there was a problem. A variant that includes both an ins and a del that could be expanded to the same range, e.g. rs201454468 with alleles ref=T, alt="", alt=TT, cannot be trimmed as a list because it already has "". T/"" is already minimal. However, T/TT is not already minimal and needs to be minimized before expansion. So no expansion was performed, and we ended up with inconsistent alleles between different sources because one was expanded (it had only the del) and the other wasn't. No good. I think that's why, although I was trimming listwise already, I did another trim just before trying to expand. Is there even a point in trimming listwise? So... back to individual trimming of each SPDI. And individual expansion. But what do we do about the 1 in 680M case of rs782394990? I think we should detect inconsistent dels -- and then pad out all dels and inses to the maximal range. So for the allele that could be shifted 2 bases less than the others because it included a T, just pad its del and ins with genomic bases. OK, now it works. Squash branch into master... diff --git src/hg/makeDb/trackDb/human/dbSnp153Composite.html src/hg/makeDb/trackDb/human/dbSnp153Composite.html index 5f622ea..8699190 100644 --- src/hg/makeDb/trackDb/human/dbSnp153Composite.html +++ src/hg/makeDb/trackDb/human/dbSnp153Composite.html @@ -188,225 +188,225 @@ <a name="ucscNotes"> <h3>Interesting and anomalous conditions noted by UCSC</h3> <p> While processing the information downloaded from dbSNP, UCSC annotates some properties of interest. These are noted on the item details page, and may be useful to include or exclude affected variants. </p><p> Some are purely informational: </p> <table class="descTbl"> <tr><th>keyword in data file (dbSnp153.bb)</th> <th># in hg19</th><th># in hg38</th><th>description</th></tr> <tr> <td>clinvar</td> - <td class="number">454674</td> - <td class="number">453990</td> + <td class="number">454678</td> + <td class="number">453996</td> <td>Variant is in ClinVar.</td> </tr> <tr> <td>clinvarBenign</td> - <td class="number">143860</td> - <td class="number">143730</td> + <td class="number">143864</td> + <td class="number">143736</td> <td>Variant is in ClinVar with clinical significance of benign and/or likely benign.</td> </tr> <tr> <td>clinvarConflicting</td> <td class="number">7932</td> <td class="number">7950</td> <td>Variant is in ClinVar with reports of both benign and pathogenic significance.</td> </tr> <tr> <td>clinvarPathogenic</td> <td class="number">96242</td> <td class="number">95262</td> <td>Variant is in ClinVar with clinical significance of pathogenic and/or likely pathogenic.</td> </tr> <tr> <td>commonAll</td> - <td class="number">12184226</td> - <td class="number">12438325</td> + <td class="number">12184521</td> + <td class="number">12438655</td> <td>Variant is "common", i.e. has a Minor Allele Frequency of at least 1% in all projects reporting frequencies.</td> </tr> <tr> <td>commonSome</td> - <td class="number">20540882</td> - <td class="number">20902602</td> + <td class="number">20541190</td> + <td class="number">20902944</td> <td>Variant is "common", i.e. has a Minor Allele Frequency of at least 1% in some, but not all, projects reporting frequencies.</td> </tr> <tr> <td>diffMajor</td> - <td class="number">1377817</td> - <td class="number">1399094</td> + <td class="number">1377831</td> + <td class="number">1399109</td> <td>Different frequency sources have different major alleles.</td> </tr> <tr> <td>overlapDiffClass</td> - <td class="number">107003090</td> - <td class="number">109991096</td> + <td class="number">107015341</td> + <td class="number">110007682</td> <td>This variant overlaps another variant with a different type/class.</td> </tr> <tr> <td>overlapSameClass</td> - <td class="number">16910407</td> - <td class="number">17281744</td> + <td class="number">16915239</td> + <td class="number">17291289</td> <td>This variant overlaps another with the same type/class but different start/end.</td> </tr> <tr> <td>rareAll</td> - <td class="number">662595470</td> - <td class="number">681685476</td> + <td class="number">662601770</td> + <td class="number">681696398</td> <td>Variant is "rare", i.e. has a Minor Allele Frequency of less than 1% in all projects reporting frequencies, or has no frequency data.</td> </tr> <tr> <td>rareSome</td> - <td class="number">670952126</td> - <td class="number">690149753</td> + <td class="number">670958439</td> + <td class="number">690160687</td> <td>Variant is "rare", i.e. has a Minor Allele Frequency of less than 1% in some, but not all, projects reporting frequencies, or has no frequency data.</td> </tr> <tr> <td>revStrand</td> - <td class="number">3813467</td> - <td class="number">4532270</td> + <td class="number">3813702</td> + <td class="number">4532511</td> <td>Alleles are displayed on the + strand at the current position. dbSNP's alleles are displayed on the + strand of a different assembly sequence, so dbSNP's variant page shows alleles that are reverse-complemented with respect to the alleles displayed above.</td> </tr> </table> <p> while others may indicate that the reference genome contains a rare variant or sequencing issue: </p> <table class="descTbl"> <tr><th>keyword in data file (dbSnp153.bb)</th> <th># in hg19</th><th># in hg38</th><th>description</th></tr> <tr> <td>refIsAmbiguous</td> <td class="number">101</td> <td class="number">111</td> <td>The reference genome allele contains an IUPAC ambiguous base (e.g. 'R' for 'A or G', or 'N' for 'any base').</td> </tr> <tr> <td>refIsMinor</td> - <td class="number">3271878</td> - <td class="number">3360159</td> + <td class="number">3272116</td> + <td class="number">3360435</td> <td>The reference genome allele is not the major allele in at least one project.</td> </tr> <tr> <td>refIsRare</td> - <td class="number">136452</td> - <td class="number">160723</td> + <td class="number">136547</td> + <td class="number">160827</td> <td>The reference genome allele is rare (i.e. allele frequency < 1%).</td> </tr> <tr> <td>refIsSingleton</td> - <td class="number">37783</td> - <td class="number">50865</td> + <td class="number">37832</td> + <td class="number">50927</td> <td>The reference genome allele has never been observed in a population sequencing project reporting frequencies.</td> </tr> <tr> <td>refMismatch</td> <td class="number">4</td> <td class="number">33</td> <td>The reference genome allele reported by dbSNP differs from the GenBank assembly sequence. This is very rare and in all cases observed so far, the GenBank assembly has an 'N' while the RefSeq assembly used by dbSNP has a less ambiguous character such as 'R'.</td> </tr> </table> <p> and others may indicate an anomaly or problem with the variant data: </p> <table class="descTbl"> <tr><th>keyword in data file (dbSnp153.bb)</th> <th># in hg19</th><th># in hg38</th><th>description</th></tr> <tr> <td>altIsAmbiguous</td> - <td class="number">10754</td> - <td class="number">10880</td> + <td class="number">10755</td> + <td class="number">10888</td> <td>At least one alternate allele contains an IUPAC ambiguous base (e.g. 'R' for 'A or G'). For alleles containing more than one ambiguous base, this may create a combinatoric explosion of possible alleles.</td> </tr> <tr> <td>classMismatch</td> - <td class="number">5995</td> - <td class="number">6206</td> + <td class="number">5998</td> + <td class="number">6216</td> <td>Variation class/type is inconsistent with alleles mapped to this genome assembly.</td> </tr> <tr> <td>clusterError</td> - <td class="number">114685</td> - <td class="number">128109</td> + <td class="number">114826</td> + <td class="number">128306</td> <td>This variant has the same start, end and class as another variant; they probably should have been merged into one variant.</td> </tr> <tr> <td>freqIncomplete</td> <td class="number">3922</td> <td class="number">4673</td> <td>At least one project reported counts for only one allele which implies that at least one allele is missing from the report; that project's frequency data are ignored.</td> </tr> <tr> <td>freqIsAmbiguous</td> <td class="number">7656</td> <td class="number">7756</td> <td>At least one allele reported by at least one project that reports frequencies contains an IUPAC ambiguous base.</td> </tr> <tr> <td>freqNotMapped</td> <td class="number">2685</td> <td class="number">6590</td> <td>At least one project reported allele frequencies relative to a different assembly; However, dbSNP does not include a mapping of this variant to that assembly, which implies a problem with mapping the variant across assemblies. The mapping on this assembly may have an issue; evaluate carefully vs. original submissions, which you can view by clicking through to dbSNP above.</td> </tr> <tr> <td>freqNotRefAlt</td> - <td class="number">17684</td> - <td class="number">32150</td> + <td class="number">17694</td> + <td class="number">32170</td> <td>At least one allele reported by at least one project that reports frequencies does not match any of the reference or alternate alleles listed by dbSNP.</td> </tr> <tr> <td>multiMap</td> - <td class="number">562157</td> - <td class="number">132051</td> + <td class="number">562180</td> + <td class="number">132123</td> <td>This variant has been mapped to more than one distinct genomic location.</td> </tr> <tr> <td>otherMapErr</td> - <td class="number">113416</td> - <td class="number">203580</td> + <td class="number">114095</td> + <td class="number">204219</td> <td>At least one other mapping of this variant has erroneous coordinates. The mapping(s) with erroneous coordinates are excluded from this track and are included in the Map Err subtrack. Sometimes despite this mapping having legal coordinates, there may still be an issue with this mapping's coordinates and alleles; you may want to click through to dbSNP to compare the initial submission's coordinates and alleles. - In hg19, 55453 distinct rs IDs are affected; in hg38, 86636. + In hg19, 55454 distinct rs IDs are affected; in hg38, 86636. </tr> </table> <h2>Data Sources and Methods</h2> <p> dbSNP has collected genetic variant reports from researchers worldwide for <a href="https://ncbiinsights.ncbi.nlm.nih.gov/2019/10/07/dbsnp-celebrates-20-years/" target=_blank>over 20 years</a>. Since the advent of next-generation sequencing methods and the population sequencing efforts that they enable, dbSNP has grown exponentially, requiring a new data schema, computational pipeline, web infrastructure, and download files. (Holmes <em>et al.</em>) The same challenges of exponential growth affected UCSC's presentation of dbSNP variants, so we have taken the opportunity to change our internal representation and import pipeline.