f57e872b724de4bb82b14f07db837aeed4f5174a gperez2 Wed Jun 17 03:55:08 2026 -0700 Fix commas and update wording in varFreqs description pages. refs #37733 diff --git src/hg/makeDb/trackDb/human/varFreqsArray.html src/hg/makeDb/trackDb/human/varFreqsArray.html index b92b7fb5a94..0a8fed95a94 100644 --- src/hg/makeDb/trackDb/human/varFreqsArray.html +++ src/hg/makeDb/trackDb/human/varFreqsArray.html @@ -1,176 +1,176 @@

Description

This track merges variants from three genotyping-array cohorts into a single bigBed file with predicted protein consequences and cross-database filtering. It contains 14.7 million variants from the Taiwan Precision Medicine Initiative (TPMI Axiom TPM1 chip, -~1 million Han Chinese), the Mexico Biobank (MexBB, 6,011 individuals), and UK Biobank +~1 million Han Chinese), the Mexico Biobank (MexBB, 6,011 individuals), and the UK Biobank (361k unrelated white British, imputed from the Neale Lab Round 2 release).

The array track is kept separate from the sequencing-based combined tracks (Disease cohorts and Population reference) so that sequencing-based and array-based frequencies can be inspected independently. For a summary of all available variant frequency databases, see the SNV Frequencies supertrack page.

Display Conventions

Color by Consequence

Variants are colored by their most severe predicted consequence:

Color	Consequence class	Examples
	Protein-truncating / loss-of-function	stop_gained, frameshift, splice_donor, splice_acceptor, stop_lost, start_lost
	Missense / in-frame	missense, inframe_insertion, inframe_deletion, protein_altering
	Synonymous	synonymous, stop_retained
	Non-coding / intergenic	intron, non_coding, intergenic, UTR

Amino Acid Change Notation

The "AA change" field uses bcftools csq notation: 23I>23V means position 23 changed from Isoleucine (I) to Valine (V) (missense). 23I alone (no arrow) means position 23 is Isoleucine and unchanged (synonymous). A "*" indicates a stop codon (e.g. 45R>45* is a stop_gained).

Caveats

Allele frequencies from genotyping arrays are not directly comparable to those from whole-genome or whole-exome sequencing. Two limitations to keep in mind:

Probe coverage is sparse and curated. Array variants are only those the - manufacturer designed probes for. Absence from this track does not mean a +
Probe coverage is sparse and curated. Array variants are only those for which the + manufacturer designed probes. Absence from this track does not mean a variant is absent in that population, only that it was not on the chip.
Per-variant call confidence varies and is sometimes unreported. TPMI publishes a per-probe NGS_concordance value (chip-vs-sequencing concordance from its own validation) in the source VCF; high-AF claims with low concordance are - common. MexBB ships only AN/AF/AC with no FILTER column and no per-site QC at all. + common. MexBB provides only AN/AF/AC with no FILTER column and no per-site QC. For both arrays, high-AF rare-disease candidates should be cross-checked against the sequencing-based Population reference track before drawing conclusions.

Filters

This track supports filtering via the track settings page. Click the track title or use the "Configure" button to access filters.

Variant Type and Consequence

Variant Type: SNV, Insertion, Deletion, or MNV.
Consequence: Missense, Synonymous, Stop Gained, Frameshift, Splice Donor, Splice Acceptor, Intron, 3' UTR, 5' UTR, Non-coding, Intergenic, or Other. The filter uses OR logic across the comma-separated consequence tokens on each variant. See the SNV Frequencies supertrack page for a complete description of the "Other" bucket.

Frequency and Count Filters

Max Allele Frequency: Filter by the maximum AF observed across the three array sources.
Total Allele Count: Filter by the sum of allele counts across all three databases.
Per-database AF and AC: Filter by allele frequency or count in any specific source (TPMI Taiwan, Mexico Biobank, UK Biobank imputed).

Source Database

The Source Database filter restricts the display to variants present in specific databases. It uses OR logic.

Length Filters

Reference/Alternate Length: Filter by the length of the reference or alternate allele.
Length Change: Filter by the size difference between alternate and reference (positive = insertion, negative = deletion, zero = SNV or MNV).

Methods

The same merge-and-annotate pipeline used for the sequencing-based combined tracks (Disease cohorts and Population reference) was run on the array-cohort subset of source VCFs. Each VCF was stripped of its INFO fields, normalized with bcftools norm (splitting multi-allelic sites), and merged with bcftools merge. The merged VCF was then annotated with predicted protein consequences using bcftools csq with the Ensembl GRCh38 release 115 gene annotation (GFF3).

The track's makeDoc file documents how each source VCF was converted. Scripts are available from Github.

Data Access

The data can be explored interactively with the Table Browser or the Data Integrator. For programmatic access, our REST API can be used; the track name is varFreqsArray.

Because the merged callset includes data from multiple sources whose redistribution licenses differ, the combined bigBed is not available for download from our download server. The combined track can be reconstructed from the individual source VCFs using the conversion scripts on GitHub together with the build documentation.

Credits

This track is only possible thanks to the participants in TPMI, the Mexico Biobank, and UK Biobank, who donated samples and provided health information. Click on the individual TPMI, MexBB, or UK Biobank subtracks in the SNV Frequencies supertrack for full project credits. Thanks to Alex Ioannidis, UCSC, for the motivation for this track family and to Andreas Lahner, MGZ, for feedback.

References

For primary citations of each source dataset, see the References section on the SNV Frequencies supertrack page. The merged-track build itself uses the following tools:

Danecek P, McCarthy SA. BCFtools/csq: haplotype-aware variant consequences. Bioinformatics. 2017 Jul 1;33(13):2037-2039. PMID: 28205675; PMC: PMC5870570

McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. The Ensembl Variant Effect Predictor. Genome Biol. 2016 Jun 6;17(1):122. PMID: 27268795; PMC: PMC4893825