2f44ab096235810d5d621b7356bc64fbebe82494 lrnassar Wed May 13 13:02:57 2026 -0700 varFreqs: fix numeric-claim discrepancies on indigenomes.html and sgdpFreq.html. indigenomes.html: clarify that the deployed VCF is the public release subset (18,016,257 records) of the larger Jain 2021 callset (55.8M variants), and note that the public release is sites-only with a VRT variant-type INFO field and no AC/AF. sgdpFreq.html: update Methods to reflect the deployed file (44,756,737 SNV records, 601,775 multiallelic decomposed); drop the "34.4M SNPs + 2.1M indels" claim; clarify that the Mallick 2016 FermiKit indel callset is not carried in this track. refs #36642 diff --git src/hg/makeDb/trackDb/human/sgdpFreq.html src/hg/makeDb/trackDb/human/sgdpFreq.html index 2493a74358b..c1ff40f6849 100644 --- src/hg/makeDb/trackDb/human/sgdpFreq.html +++ src/hg/makeDb/trackDb/human/sgdpFreq.html @@ -27,38 +27,41 @@ track name is sgdpFreq. For bulk download, the VCF file can be obtained from our download server.
The original source VCFs are available from https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/vcf_variants/.
High-coverage whole-genome sequencing of 300 individuals (279 publicly available) from 142 diverse populations was performed on Illumina instruments using PCR-free library preparation at an average depth of 43x. Reads were aligned to the hs37d5 reference (GRCh37 with decoy -sequences) using BWA-MEM 0.7.12. SNP and indel genotyping was performed using GATK -HaplotypeCaller with joint genotyping across all samples. An independent indel callset was -generated using FermiKit for improved sensitivity at complex variants. The final dataset -contains 34.4 million SNPs and 2.1 million short indels. +sequences) using BWA-MEM 0.7.12. SNP genotyping was performed using GATK +HaplotypeCaller with joint genotyping across all samples. (The Mallick 2016 release also +includes an independent indel callset generated with FermiKit; indels are not carried in +this track.)
-The VCFs were merged with bcftools and lifted to hg38 with CrossMap. At UCSC, genotypes were -stripped to produce a sites-only frequency VCF retaining the existing AC, AF, and AN INFO fields. +The per-sample VCFs were merged with bcftools and lifted to hg38 with CrossMap. At UCSC, +genotypes were stripped to produce a sites-only frequency VCF retaining the AC, AF, and AN +INFO fields. The deployed file contains 44,756,737 SNV records (601,775 of which represent +multiallelic sites split into separate biallelic records). Indels from the source callset +are not included. We provide documentation that indicates how all source files were converted in the makeDoc file of the track. Python scripts are also available from GitHub.
This project was funded by the Simons Foundation. Thanks to David Reich and Swapan Mallick for help with importing the data.
Mallick S, Li H, Lipson M, Mathieson I, Gymrek M, Racimo F, Zhao M, Chennagiri N, Nordenfelt S, Tandon A et al.