b7b279e5f1240419fb3a408fff5c82a998a36e76
max
Wed Jun 3 08:17:09 2026 -0700
EVE track QA fixes: fix .as field descriptions, clarify protein count in HTML
- thickStart/thickEnd .as descriptions corrected (whole protein span, not start/stop codon)
- reserved .as description corrected (always 0, not itemRgb)
- HTML Methods section: explain that 2,951 of 3,219 proteins have released VCF scores
Co-Authored-By: Claude Sonnet 4.6
Missense mutations change a single amino acid in a protein and make up a large fraction
of the variants observed in human populations. Most have no known clinical significance,
both because clinical data are limited and because the relationship between sequence change
and protein function is complex. Computational approaches that model evolutionary
conservation across species can estimate how much a given position tolerates change, since
residues critical to function tend to be highly conserved. EVE (Evolutionary model of
Variant Effect) formalizes this reasoning with a deep generative model trained entirely on
natural sequence variation, without any reliance on clinical labels. This track shows EVE
scores for all possible missense substitutions in 2,949 disease-associated proteins.
Each entry spans one protein at its genomic locus. The heatmap columns correspond to
individual amino acid positions in the protein, placed at the codon's genomic coordinate.
The rows correspond to the 20 standard amino acids (A–Y, alphabetical). Each cell
shows the EVE score for substituting the wildtype amino acid at that position with the row
amino acid. Empty cells indicate the wildtype amino acid at a given position (no
substitution) or positions for which no score is available.
Cells are colored on a gradient:
EVE ≈ 0 — benign: the substitution is well-tolerated in evolutionary sequence variation
Hovering over a cell shows the wildtype amino acid, the protein position, the variant amino
acid, the EVE score, and the Class25 classification (benign, uncertain, or pathogenic using
a 25% uncertainty threshold).
For reverse-strand genes, protein positions are displayed left to right in genomic
order (C-terminus to N-terminus on the screen), consistent with the standard genome
browser orientation.
EVE trains a Bayesian variational autoencoder (VAE) separately for each of 3,219
-disease-associated human proteins. For each protein, a multiple sequence alignment is
+disease-associated human proteins; of these, 2,951 have missense VCF scores in the
+public bulk download (the remainder have sequence alignments but no released scores). For each protein, a multiple sequence alignment is
retrieved by searching roughly 250 million protein sequences from UniRef, and the VAE
learns the distribution of amino acid sequences across species, capturing both
per-position conservation and co-evolutionary dependencies between positions. An
evolutionary index for each single amino acid variant is then computed as the
approximate negative log-likelihood ratio of the variant versus the wildtype sequence,
estimated by sampling from the VAE posterior (ensembled over five independently trained
models). A global-local mixture of Gaussian mixture models, fit to the index distributions
across all variants and all proteins, converts this continuous index to an EVE score
between 0 (benign) and 1 (pathogenic) and assigns each variant to a benign, uncertain, or
pathogenic class. The uncertainty of each classification reflects the predictive entropy of
the mixture model, and a threshold on this entropy controls what fraction of variants is
labeled uncertain. The Class25 field used in the track mouseovers classifies variants using
a 25% uncertainty threshold, which the authors report yields approximately 90% accuracy on
known ClinVar labels. See Frazer et al. 2021 for full details.
The data were downloaded as a bulk archive from
https://evemodel.org/download/bulk.
The archive contains one VCF file per protein, each listing every possible missense
substitution with its EVE score and per-threshold classifications. Multiple VCF records
encoding different codon changes for the same amino acid substitution carry identical EVE
scores and were deduplicated to one record per substitution. Records were converted to
heatmap bigBed format using a custom Python script; full processing instructions are in the
makedoc file, and the conversion script is available in
our GitHub repository.
Two proteins (G6PT1, UniProt O43826; and MAFIP, Q8WZ33) were excluded because their
VCF coordinates mapped to assembly scaffolds (chrCHR_HG2217_PATCH and chrGL000194.1)
absent from the standard hg38 assembly. The remaining 2,949 proteins covering approximately
1.7 million amino acid positions are included in this track.
The data can be explored interactively in table format with the
Table Browser or the
Data Integrator and exported from there to
spreadsheet or tab-separated tables. From scripts, the data can be accessed through our
-API, track=eve.Description
Display Conventions
EVE ≈ 0.5 — uncertain: the model cannot confidently classify the variant
EVE ≈ 1 — pathogenic: the substitution is rare or absent across evolutionary sequences
Methods
Data Access
For automated download and analysis, the genome annotation is stored in a bigBed file that can be downloaded from our download server. The file for this track is called eve.bb. Individual regions or the whole genome annotation can be obtained using our tool bigBedToBed, which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found -here. +here. The tool can also be used to obtain features within a given range, e.g. bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/eve/eve.bb -chrom=chr17 -start=43000000 -end=43200000 stdout
The original annotation source data can be downloaded from https://evemodel.org/download/bulk.
Thanks to Jonathan Frazer, Pascal Notin, Mafalda Dias, and Debora S. Marks at Harvard Medical School and Yarin Gal at the University of Oxford for making the EVE scores publicly available at evemodel.org.
Frazer J, Notin P, Dias M, Gomez A, Min JK, Brock K, Gal Y, Marks DS. Disease variant prediction with deep generative models of evolutionary data. Nature. 2021 Nov;599(7883):91-95. PMID: 34707284