9bfd58221b1539193cb7f0a317b4e959c1c7e49a max Thu May 21 01:00:45 2026 -0700 varFreqs: AI generated text sounds bad, hard to read, so remove typical AI language. "humanizer" pass on all 31 varFreqs description pages — cut em dashes, copula avoidance ("serves as", "stands as"), "-ing" puffery, and boilerplate filler ("We provide documentation that indicates how..."). Title-case headings and meaningful emphasis preserved. No facts/URLs/counts/versions changed. tpmi.html added as a new file (was previously uncommitted). refs #36642 Co-Authored-By: Claude Sonnet 4.6 diff --git src/hg/makeDb/trackDb/human/varFreqs.html src/hg/makeDb/trackDb/human/varFreqs.html index 2f364dab65d..d6f2839a8dd 100644 --- src/hg/makeDb/trackDb/human/varFreqs.html +++ src/hg/makeDb/trackDb/human/varFreqs.html @@ -1,54 +1,54 @@
Description

This supertrack collects variant allele frequencies from population-scale sequencing and genotyping projects worldwide, from a total of ~1.7 million genomes/exomes/arrays. -The data was not reprocessed in a harmonized way but the variant VCFs were collected from the projects. -The goal is to provide a single place to compare how common +The data was not reprocessed in a harmonized way; the variant VCFs were collected from the projects as-is. +The goal is a single place to compare how common a variant is across different populations, ancestries, and cohorts, for projects that cannot be recomputed by gnomAD soon. The main -combined track merges all databases into one single summary track, +combined track merges all databases into one summary track, with filters, summed population frequencies and recalculated protein-effect annotations. -In addition, there is one subtrack per project with the original VCF data and all the annotations that the project provides. -The different projects use different pipelines and sequencing technologies, click any of the projects +There is also one subtrack per project with the original VCF data and all the annotations that the project provides. +The different projects use different pipelines and sequencing technologies. Click any of the projects above or below for a summary of their sample selection, sequencing assay and software pipeline. -Many projects do not allow us to distribute the data but we document how the -data can be requested and provide all converters.
+Many projects do not allow us to distribute the data, but we document how to request it +and provide all converters.

Data from projects that provide haplotype-phased genotypes can also be found elsewhere: 1000 Genomes is also a separate track, and the phased genotypes HGDP, SGDP, HGDP+1000 Genomes and Mexico Biobank can also be found in the "Phased Variants" track. Their VCF versions below show only the isolate frequency per variant.

Please contact us (genome@soe.ucsc.edu), if you know a project that we should add. So far, Regeneron's Million Exomes and Mexico City Studies (request rejected) and Taiwan Biobank (pending).

Combined Track (All Databases)

The "All Databases Combined" track merges variants from all individual databases into a single bigBed file with consequence annotations, totaling 1.17 billion variants from ~1.7 million individuals. The track supports filtering by variant type (SNV, insertion, deletion, MNV), predicted consequence (missense, synonymous, stop gained, frameshift, splice, intron, intergenic), source database, allele frequency (overall maximum -and per-database), and allele count (total or per-database). This track is either useful in dense mode -for getting a quick overview of variant density across all projects, or with filters to find -variants present in specific databases or within certain frequency ranges. Note that with the "clone track" -feature you can clone this track and have multiple versions, each with different filters activated. -You can also use our "Density mode" checkbox on the track configuration page to show a plot with the +and per-database), and allele count (total or per-database). The track is useful in dense mode +to get a quick overview of variant density across all projects, or with filters to find +variants present in specific databases or within certain frequency ranges. With the "clone track" +feature you can clone this track and keep multiple versions, each with different filters activated. +The "Density mode" checkbox on the track configuration page shows a plot of the density of variants passing a filter, one per track clone.

Available Datasets
@@ -338,63 +338,63 @@
Notes on Specific Sub-tracks
AllOfUs — local-ancestry-stratified frequencies
The AllOfUs subtrack ships local-ancestry-stratified allele frequencies, not the global ancestry categories used in the All of Us Research Program 2024 Nature paper (see References). Each variant's per-ancestry AF/AC counts only the haplotypes whose inferred local ancestry at that exact genomic position belongs to the named group (strict-both-haps mode). The six ancestry classes (African, Indigenous American, East Asian, European, Oceanian, South Asian) match HGDP-derived local-ancestry reference panels and so include Oceanian, which is not one of the paper's six global Rye categories (those are AFR, AMR, EAS, EUR, Middle Eastern, SAS). For an admixed individual, the local-ancestry AF at a position can therefore differ substantially from the AF among self-reported members of the same ancestry group. -The pipeline that produced this VCF was developed by the Ioannidis lab (Phoenix, UCSC) -and applied to the AllOfUs v7 release; only variants with cohort allele count ≥ 20 +The Ioannidis lab (Phoenix, UCSC) developed the pipeline that produced this VCF +and applied it to the AllOfUs v7 release; only variants with cohort allele count ≥ 20 were retained.
gnomAD HGDP+1kG — cohort vs full-release frequencies
This subtrack derives from the gnomAD v3.1.2 release, which embeds the 4,094-genome jointly-called HGDP+1kG cohort (Koenig et al. 2024) inside the larger -gnomAD aggregation. To save space, only INFO fields useful for clinical and -population-genetic interpretation were retained. Two distinct allele-frequency +gnomAD aggregation. To save space, we kept only INFO fields useful for clinical and +population-genetic interpretation. Two allele-frequency sets are exposed:

The cohort-level AC/AF/AN fields (no prefix) are computed across the ~3,400 unrelated HGDP+1kG individuals (allele number ≈ 6,800).

The per-population filter fields (gnomAD v3.1.2 African AF, gnomAD v3.1.2 Latino AF, etc.) are values from the full gnomAD v3.1.2 release (~76,000 genomes), not just the 4,094-genome HGDP+1kG cohort. The corresponding allele numbers are typically tens of thousands per population.

The trackUI labels and bigBed field descriptions reflect this distinction. Per-population -HGDP+1kG-cohort frequencies are not exposed because the cohort is too small to give -stable per-population estimates for many populations. +HGDP+1kG-cohort frequencies are not exposed because the cohort is too small for +stable per-population estimates in many populations.
Display Conventions
Most tracks only show the variant and allele frequencies on mouseover or clicks. When zoomed in, tracks display alleles with base-specific coloring. Homozygote -data are shown as one letter, while heterozygotes will be displayed with both -letters. All VCF files are normalized, with one single allele per annotation (no multi-allele +data are shown as one letter; heterozygotes are shown with both +letters. All VCF files are normalized, with one allele per annotation (no multi-allele lines).
Methods
Each subtrack ships the upstream project's VCF largely as-released; per-subtrack pipelines (coordinate liftover, format conversion, header normalization) are documented on each subtrack's own description page and recorded in the build documentation. The conversion scripts (e.g. finngen_to_vcf.py, kovaToVcf.py, schema_addAcAnAf.py, svatalogFreqToVcf.py) live alongside the makedoc in the scripts directory.
The combined "All Databases" subtrack is built by a separate pipeline: @@ -402,31 +402,31 @@ multi-sample callset, consequence annotations are recomputed against Ensembl with bcftools csq, and the result is converted to bigBed via vcfToBigBed.py + bedToBigBed. The mapping from upstream INFO fields to bigBed columns is driven by two configuration files in the scripts directory: databases.tsv (one row per source dataset) and populations.tsv (per-population AC/AF columns within each source). Editing those two files and rerunning mergeAndAnnotate.sh followed by vcfToBigBed.py rebuilds the combined track.
Data Access
All the data is publicly available. The table above indicates if we are allowed to distribute it in VCF format. Most of the databases do not allow us to redistribute the data files directly from our website, but it can always be downloaded from the original websites in some form. Click the database link in the table above and see the "Data Access" section of the respective track for a description of where to download the data. When the data is freely available from our website, the Data Access section will also indicate the VCF file location on our download server. Because it contains some licensed data, the combined track is not available for download, but can be recreated using the conversion scripts in our GitHub repository and the accompanying documentation file.
Credits
-
This track is only possible thanks to the data from millions of volunteers around the world, who donated blood, signed consent forms and provided health information about themselves and sometimes their families. Click on any of the tracks in the list above to see the specific credits for each project. Thanks to Alex Ioannidis, UCSC, for the motivation for this track and to Andreas Lahner, MGZ, for feedback.
+
This track is only possible thanks to the data from millions of volunteers around the world, who donated blood, signed consent forms and provided health information about themselves and sometimes their families. Click any of the tracks in the list above to see the specific credits for each project. Thanks to Alex Ioannidis, UCSC, for the motivation for this track and to Andreas Lahner, MGZ, for feedback.
References
All of Us Research Program Genomics Investigators. Genomic data in the All of Us Research Program. Nature. 2024 Mar;627(8003):340-346. PMID: 38374255; PMC: PMC10937371
Bhattacharyya C, Subramanian K, Uppili B, Biswas NK, Ramdas S, Tallapaka KB, Arvind P, Rupanagudi KV, Maitra A, Nagabandi T et al.

Database Region N Data Type Cohort Sub-populations Downloadable from UCSC