a729e1ee610d17d3dcb36a45437f1709e5699558 max Fri Jun 5 02:18:09 2026 -0700 varFreqs: fix collection page leftovers from the combined-track redesign Update the SNV Frequencies collection description page for the two-track (Affected / Background) layout: correct the combined-track count to three, rewrite the Methods paragraph that still described building the old single "All Databases" track (now describes the phenotype split via vcfToBigBed.py --split-affected), and set the Affected/Background sample-count cells in the dataset table to ~130k and ~1.5M. refs #36642 diff --git src/hg/makeDb/trackDb/human/varFreqs.html src/hg/makeDb/trackDb/human/varFreqs.html index 8a6261da7ed..39f9d48bdad 100644 --- src/hg/makeDb/trackDb/human/varFreqs.html +++ src/hg/makeDb/trackDb/human/varFreqs.html @@ -1,23 +1,23 @@ <h2>Description</h2> <p> This track collection gathers variant allele frequencies from population-scale sequencing and genotyping projects worldwide, from a total of ~1.7 million genomes/exomes/arrays. The data was not reprocessed in a harmonized way; the variant VCFs were collected from the projects as-is. The goal is a single place to compare how common a variant is across different populations, ancestries, and cohorts, for projects that cannot be recomputed by -gnomAD soon. Two combined tracks aggregate the source data along different lines, and +gnomAD soon. Three combined tracks aggregate the source data along different lines, and there is also one subtrack per project with the original VCF data and all the annotations that the project provides. The different projects use different pipelines and sequencing technologies. Click any of the projects above or below for a summary of their sample selection, sequencing assay and software pipeline. Many projects do not allow us to distribute the data, but we document how to request it and provide all converters. </p> <p> Data from projects that provide haplotype-phased genotypes can also be found elsewhere: 1000 Genomes is also a separate track, and the phased genotypes HGDP, SGDP, HGDP+1000 Genomes and Mexico Biobank can also be found in the "Phased Variants" track. Their VCF versions below show only the isolate frequency per variant. </p> <p>Please contact us (<A HREF="mailto:genome@soe.ucsc.edu">genome@soe.ucsc.edu</A><!-- above address is genome at soe.ucsc.edu -->) if you know of a project that we should add. So far, @@ -52,40 +52,40 @@ <h3>Available Datasets</h3> <table class="stdTbl"> <tr> <th>Database</th> <th>Region</th> <th>N</th> <th>Data Type</th> <th>Cohort</th> <th>Sub-populations</th> <th>Downloadable from UCSC</th> </tr> <tr> <td><a href="hgTrackUi?g=varFreqsAffected">Affected/Case Individuals</a></td> <td>Sequencing-based disease cohorts</td> - <td>—</td> + <td>~130k</td> <td>WGS/WES/long-read</td> <td>Affected/case arms of SFARI SPARK WES/WGS, SCHEMA, GREGoR, GA4K</td> <td>Affected/case AF and AC; background AF for contrast</td> <td>No</td> </tr> <tr> <td><a href="hgTrackUi?g=varFreqsBackground">Population + Unaffected</a></td> <td>Sequencing-based, population + unaffected</td> - <td>~1.7mil</td> + <td>~1.5mil</td> <td>WGS/WES/long-read</td> <td>Population cohorts + unaffected/control arms</td> <td>Background AF and AC; per-cohort and ancestry breakdowns</td> <td>No</td> </tr> <tr> <td><a href="hgTrackUi?g=varFreqsArray">Genotyping Array Databases Combined</a></td> <td>TPMI, MexBB, UKBB</td> <td>~530k</td> <td>Array / imputed</td> <td>14.7M variants</td> <td>—</td> <td>No</td> </tr> <tr> @@ -422,39 +422,42 @@ letters. All VCF files are normalized, with one allele per annotation (no multi-allele lines). </p> <h2>Methods</h2> <p> Each subtrack includes the upstream project's VCF largely as-released; per-subtrack pipelines (coordinate liftover, format conversion, header normalization) are documented on each subtrack's own description page and recorded in the <a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/varFreqs.txt" target="_blank">build documentation</a>. The conversion scripts (<em>e.g.</em> <code>finngen_to_vcf.py</code>, <code>kovaToVcf.py</code>, <code>schema_addAcAnAf.py</code>, <code>svatalogFreqToVcf.py</code>) live alongside the makedoc in the <a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs" target="_blank">scripts directory</a>. </p> <p> -The combined "All Databases" subtrack is built by a separate pipeline: +The combined Affected and Background tracks are built by a separate pipeline: each per-subtrack VCF is normalized (<code>bcftools norm</code>), all sites are merged into a single -multi-sample callset, consequence annotations are recomputed against Ensembl with <code>bcftools csq</code>, -and the result is converted to bigBed via <code>vcfToBigBed.py</code> + <code>bedToBigBed</code>. -The mapping from upstream INFO fields to bigBed columns is driven by two configuration files in the -scripts directory: <code>databases.tsv</code> (one row per source dataset) and -<code>populations.tsv</code> (per-population AC/AF columns within each source). -Editing those two files and rerunning <code>mergeAndAnnotate.sh</code> followed by -<code>vcfToBigBed.py</code> rebuilds the combined track. +callset, consequence annotations are recomputed against Ensembl with <code>bcftools csq</code>, +and the merged callset is split by phenotype into the two bigBed files via +<code>vcfToBigBed.py</code> + <code>bedToBigBed</code>. The mapping from upstream INFO fields to +bigBed columns is driven by two configuration files in the scripts directory: +<code>databases.tsv</code> (one row per source dataset, flagging which cohorts study a disease) +and <code>populations.tsv</code> (per-population AC/AF columns within each source, including the +affected and unaffected arm of each disease cohort). Editing those two files and rerunning +<code>mergeAndAnnotate.sh</code> followed by <code>vcfToBigBed.py --split-affected</code> rebuilds +the two tracks. The Genotyping Array Databases Combined track is built the same way from the +array cohorts only. </p> <h2>Data Access</h2> <p>All the data is publicly available. The table above indicates if we are allowed to distribute it in VCF format. Most of the databases do not allow us to redistribute the data files directly from our website, but it can always be downloaded from the original websites in some form. Click the database link in the table above and see the "Data Access" section of the respective track for a description of where to download the data. When the data is freely available from our website, the Data Access section will also indicate the VCF file location on our download server. Because it contains some licensed data, the combined track is not available for download, but can be recreated using the conversion scripts in our <a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs" target="_blank">GitHub repository</a> and the accompanying <a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/varFreqs.txt" target="_blank">documentation file</a>. </p> <h2>Credits</h2> <p>This track is only possible thanks to the data from millions of volunteers around the world, who donated blood, signed consent forms and provided health information about themselves and sometimes their families. Click any of the tracks in the list above to see the specific credits for each project. Thanks to Alex Ioannidis, UCSC, for the inspiration for this track and to Andreas Lahner, MGZ, for feedback.</p> <h2>References</h2> <p> All of Us Research Program Genomics Investigators. <a href="https://doi.org/10.1038/s41586-023-06957-x" target="_blank">