8c1ebdae695b1bb40e45f41aff4a4f9fb9e52491 max Thu May 28 05:43:13 2026 -0700 [Claude] add EVE missense variant effect heatmap track for hg38 #Preview2 week - bugs introduced now will need a build patch to fix Heatmap bigBed track showing EVE scores for all possible missense substitutions in 2,949 disease-associated proteins. One entry per protein; columns = amino acid positions at genomic codon coordinates, rows = 20 standard amino acids. Colors blue (benign) to red (pathogenic). Includes conversion script, autoSql, trackDb, HTML doc, and makedoc. Added to predictionScoresSuper. refs #31804 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> diff --git src/hg/makeDb/trackDb/human/eve.html src/hg/makeDb/trackDb/human/eve.html new file mode 100644 index 00000000000..1ae02d9f474 --- /dev/null +++ src/hg/makeDb/trackDb/human/eve.html @@ -0,0 +1,122 @@ +<h2>Description</h2> +<p> +Missense mutations change a single amino acid in a protein and make up a large fraction +of the variants observed in human populations. Most have no known clinical significance, +both because clinical data are limited and because the relationship between sequence change +and protein function is complex. Computational approaches that model evolutionary +conservation across species can estimate how much a given position tolerates change, since +residues critical to function tend to be highly conserved. EVE (Evolutionary model of +Variant Effect) formalizes this reasoning with a deep generative model trained entirely on +natural sequence variation, without any reliance on clinical labels. This track shows EVE +scores for all possible missense substitutions in 2,949 disease-associated proteins. +</p> + +<h2>Display Conventions</h2> +<p> +Each entry spans one protein at its genomic locus. The heatmap columns correspond to +individual amino acid positions in the protein, placed at the codon's genomic coordinate. +The rows correspond to the 20 standard amino acids (A–Y, alphabetical). Each cell +shows the EVE score for substituting the wildtype amino acid at that position with the row +amino acid. Empty cells indicate the wildtype amino acid at a given position (no +substitution) or positions for which no score is available. +</p> + +<p> +Cells are colored on a gradient: +</p> +<p> +<span style="display:inline-block; background-color:#2166ac; width:18px; height:12px; vertical-align:middle;"></span> +<b>EVE ≈ 0 — benign:</b> the substitution is well-tolerated in evolutionary sequence variation<br> +<span style="display:inline-block; background-color:#f7f7f7; width:18px; height:12px; vertical-align:middle; border:1px solid #ccc;"></span> +<b>EVE ≈ 0.5 — uncertain:</b> the model cannot confidently classify the variant<br> +<span style="display:inline-block; background-color:#d6604d; width:18px; height:12px; vertical-align:middle;"></span> +<b>EVE ≈ 1 — pathogenic:</b> the substitution is rare or absent across evolutionary sequences +</p> + +<p> +Hovering over a cell shows the wildtype amino acid, the protein position, the variant amino +acid, the EVE score, and the Class25 classification (benign, uncertain, or pathogenic using +a 25% uncertainty threshold). +</p> + +<p> +For reverse-strand genes, protein positions are displayed left to right in genomic +order (C-terminus to N-terminus on the screen), consistent with the standard genome +browser orientation. +</p> + +<h2>Methods</h2> +<p> +EVE trains a Bayesian variational autoencoder (VAE) separately for each of 3,219 +disease-associated human proteins. For each protein, a multiple sequence alignment is +retrieved by searching roughly 250 million protein sequences from UniRef, and the VAE +learns the distribution of amino acid sequences across species, capturing both +per-position conservation and co-evolutionary dependencies between positions. An +evolutionary index for each single amino acid variant is then computed as the +approximate negative log-likelihood ratio of the variant versus the wildtype sequence, +estimated by sampling from the VAE posterior (ensembled over five independently trained +models). A global-local mixture of Gaussian mixture models, fit to the index distributions +across all variants and all proteins, converts this continuous index to an EVE score +between 0 (benign) and 1 (pathogenic) and assigns each variant to a benign, uncertain, or +pathogenic class. The uncertainty of each classification reflects the predictive entropy of +the mixture model, and a threshold on this entropy controls what fraction of variants is +labeled uncertain. The Class25 field used in the track mouseovers classifies variants using +a 25% uncertainty threshold, which the authors report yields approximately 90% accuracy on +known ClinVar labels. See Frazer et al. 2021 for full details. +</p> + +<p> +The data were downloaded as a bulk archive from +<a href="https://evemodel.org/download/bulk" target="_blank">https://evemodel.org/download/bulk</a>. +The archive contains one VCF file per protein, each listing every possible missense +substitution with its EVE score and per-threshold classifications. Multiple VCF records +encoding different codon changes for the same amino acid substitution carry identical EVE +scores and were deduplicated to one record per substitution. Records were converted to +heatmap bigBed format using a custom Python script; full processing instructions are in the +<a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/eve.txt" +target="_blank">makedoc file</a>, and the conversion script is available in +<a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/eve" +target="_blank">our GitHub repository</a>. +Two proteins (G6PT1, UniProt O43826; and MAFIP, Q8WZ33) were excluded because their +VCF coordinates mapped to assembly scaffolds (chrCHR_HG2217_PATCH and chrGL000194.1) +absent from the standard hg38 assembly. The remaining 2,949 proteins covering approximately +1.7 million amino acid positions are included in this track. +</p> + +<h2>Data Access</h2> +<p>The data can be explored interactively in table format with the +<a href="../cgi-bin/hgTables">Table Browser</a> or the +<a href="../cgi-bin/hgIntegrator">Data Integrator</a> and exported from there to +spreadsheet or tab-separated tables. From scripts, the data can be accessed through our +<a href="https://api.genome.ucsc.edu">API</a>, track=<i>eve</i>.</p> +<p>For automated download and analysis, the genome annotation is stored in a bigBed file +that can be downloaded from +<a href="http://hgdownload.soe.ucsc.edu/gbdb/hg38/eve" target="_blank">our download +server</a>. The file for this track is called <tt>eve.bb</tt>. Individual regions or the +whole genome annotation can be obtained using our tool <tt>bigBedToBed</tt>, which can be +compiled from the source code or downloaded as a precompiled binary for your system. +Instructions for downloading source code and binaries can be found +<a href="http://hgdownload.soe.ucsc.edu/downloads.html#utilities_downloads">here</a>. +The tool can also be used to obtain features within a given range, e.g. +<tt>bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/eve/eve.bb -chrom=chr17 +-start=43000000 -end=43200000 stdout</tt></p> +<p>The original annotation source data can be downloaded from +<a href="https://evemodel.org/download/bulk" target="_blank">https://evemodel.org/download/bulk</a>.</p> + +<h2>Credits</h2> +<p> +Thanks to Jonathan Frazer, Pascal Notin, Mafalda Dias, and Debora S. Marks at Harvard +Medical School and Yarin Gal at the University of Oxford for making the EVE scores +publicly available at +<a href="https://evemodel.org/" target="_blank">evemodel.org</a>. +</p> + +<h2>References</h2> +<p> +Frazer J, Notin P, Dias M, Gomez A, Min JK, Brock K, Gal Y, Marks DS. +<a href="https://doi.org/10.1038/s41586-021-04043-8" target="_blank"> +Disease variant prediction with deep generative models of evolutionary data</a>. +<em>Nature</em>. 2021 Nov;599(7883):91-95. +PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/34707284" target="_blank">34707284</a> +</p> +