8c1ebdae695b1bb40e45f41aff4a4f9fb9e52491 max Thu May 28 05:43:13 2026 -0700 [Claude] add EVE missense variant effect heatmap track for hg38 #Preview2 week - bugs introduced now will need a build patch to fix Heatmap bigBed track showing EVE scores for all possible missense substitutions in 2,949 disease-associated proteins. One entry per protein; columns = amino acid positions at genomic codon coordinates, rows = 20 standard amino acids. Colors blue (benign) to red (pathogenic). Includes conversion script, autoSql, trackDb, HTML doc, and makedoc. Added to predictionScoresSuper. refs #31804 Co-Authored-By: Claude Sonnet 4.6 diff --git src/hg/makeDb/trackDb/human/eve.html src/hg/makeDb/trackDb/human/eve.html new file mode 100644 index 00000000000..1ae02d9f474 --- /dev/null +++ src/hg/makeDb/trackDb/human/eve.html @@ -0,0 +1,122 @@ +

Description

+Missense mutations change a single amino acid in a protein and make up a large fraction +of the variants observed in human populations. Most have no known clinical significance, +both because clinical data are limited and because the relationship between sequence change +and protein function is complex. Computational approaches that model evolutionary +conservation across species can estimate how much a given position tolerates change, since +residues critical to function tend to be highly conserved. EVE (Evolutionary model of +Variant Effect) formalizes this reasoning with a deep generative model trained entirely on +natural sequence variation, without any reliance on clinical labels. This track shows EVE +scores for all possible missense substitutions in 2,949 disease-associated proteins. +

+ +

Display Conventions

+Each entry spans one protein at its genomic locus. The heatmap columns correspond to +individual amino acid positions in the protein, placed at the codon's genomic coordinate. +The rows correspond to the 20 standard amino acids (A–Y, alphabetical). Each cell +shows the EVE score for substituting the wildtype amino acid at that position with the row +amino acid. Empty cells indicate the wildtype amino acid at a given position (no +substitution) or positions for which no score is available. +

+ +

+Cells are colored on a gradient: +

+ +EVE ≈ 0 — benign: the substitution is well-tolerated in evolutionary sequence variation
+ +EVE ≈ 0.5 — uncertain: the model cannot confidently classify the variant
+ +EVE ≈ 1 — pathogenic: the substitution is rare or absent across evolutionary sequences +

+ +

+Hovering over a cell shows the wildtype amino acid, the protein position, the variant amino +acid, the EVE score, and the Class25 classification (benign, uncertain, or pathogenic using +a 25% uncertainty threshold). +

+ +

+For reverse-strand genes, protein positions are displayed left to right in genomic +order (C-terminus to N-terminus on the screen), consistent with the standard genome +browser orientation. +

+ +

Methods

+EVE trains a Bayesian variational autoencoder (VAE) separately for each of 3,219 +disease-associated human proteins. For each protein, a multiple sequence alignment is +retrieved by searching roughly 250 million protein sequences from UniRef, and the VAE +learns the distribution of amino acid sequences across species, capturing both +per-position conservation and co-evolutionary dependencies between positions. An +evolutionary index for each single amino acid variant is then computed as the +approximate negative log-likelihood ratio of the variant versus the wildtype sequence, +estimated by sampling from the VAE posterior (ensembled over five independently trained +models). A global-local mixture of Gaussian mixture models, fit to the index distributions +across all variants and all proteins, converts this continuous index to an EVE score +between 0 (benign) and 1 (pathogenic) and assigns each variant to a benign, uncertain, or +pathogenic class. The uncertainty of each classification reflects the predictive entropy of +the mixture model, and a threshold on this entropy controls what fraction of variants is +labeled uncertain. The Class25 field used in the track mouseovers classifies variants using +a 25% uncertainty threshold, which the authors report yields approximately 90% accuracy on +known ClinVar labels. See Frazer et al. 2021 for full details. +

+ +

+The data were downloaded as a bulk archive from +https://evemodel.org/download/bulk. +The archive contains one VCF file per protein, each listing every possible missense +substitution with its EVE score and per-threshold classifications. Multiple VCF records +encoding different codon changes for the same amino acid substitution carry identical EVE +scores and were deduplicated to one record per substitution. Records were converted to +heatmap bigBed format using a custom Python script; full processing instructions are in the +makedoc file, and the conversion script is available in +our GitHub repository. +Two proteins (G6PT1, UniProt O43826; and MAFIP, Q8WZ33) were excluded because their +VCF coordinates mapped to assembly scaffolds (chrCHR_HG2217_PATCH and chrGL000194.1) +absent from the standard hg38 assembly. The remaining 2,949 proteins covering approximately +1.7 million amino acid positions are included in this track. +

+ +

Data Access

The data can be explored interactively in table format with the +Table Browser or the +Data Integrator and exported from there to +spreadsheet or tab-separated tables. From scripts, the data can be accessed through our +API, track=eve.

For automated download and analysis, the genome annotation is stored in a bigBed file +that can be downloaded from +our download +server. The file for this track is called eve.bb. Individual regions or the +whole genome annotation can be obtained using our tool bigBedToBed, which can be +compiled from the source code or downloaded as a precompiled binary for your system. +Instructions for downloading source code and binaries can be found +here. +The tool can also be used to obtain features within a given range, e.g. +bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/eve/eve.bb -chrom=chr17 +-start=43000000 -end=43200000 stdout

The original annotation source data can be downloaded from +https://evemodel.org/download/bulk.

+ +

Credits

+Thanks to Jonathan Frazer, Pascal Notin, Mafalda Dias, and Debora S. Marks at Harvard +Medical School and Yarin Gal at the University of Oxford for making the EVE scores +publicly available at +evemodel.org. +

+ +

References

+Frazer J, Notin P, Dias M, Gomez A, Min JK, Brock K, Gal Y, Marks DS. + +Disease variant prediction with deep generative models of evolutionary data. +Nature. 2021 Nov;599(7883):91-95. +PMID: 34707284 +