aa61ebc800429515f9ced7e28f669c6042219f43 max Wed Mar 18 09:09:13 2026 -0700 varFreqs supertrack: add GREGoR track, update all HTML docs, move scripts to varFreqs/, refs #36642 Add GREGoR R04 WGS track to varFreqs superTrack. Update Data Access and Methods sections for all 20+ subtrack HTML files with consistent formatting, sequencing methods from source papers, and links to makeDoc and Github scripts. Move all varFreqs conversion scripts into scripts/varFreqs/ subdirectory and update makeDoc paths accordingly. Co-Authored-By: Claude Opus 4.6 diff --git src/hg/makeDb/trackDb/human/sgdpFreq.html src/hg/makeDb/trackDb/human/sgdpFreq.html new file mode 100644 index 00000000000..adecc0755b3 --- /dev/null +++ src/hg/makeDb/trackDb/human/sgdpFreq.html @@ -0,0 +1,69 @@ +

Description

+

+The Simons Genome Diversity Project (SGDP), funded by the Simons Foundation, +sequenced high-coverage genomes from 300 individuals (279 in this track) representing 142 diverse +and often indigenous populations worldwide. Its goal was to capture the full range of human +genetic diversity to better understand population history, migration, and adaptation. It samples +populations in a way that represents as much anthropological, linguistic and cultural diversity +as possible, and thus includes many deeply divergent human populations that are not well +represented in other datasets. +

+ +

+This track shows allele frequencies only. The full phased genotype data with haplotype +clustering display is available in the +SGDP track under Phased Variants. +Not all SGDP data is public, so this track contains only 279 genomes. +The hg38 data was lifted from hg19. +

+ +

Data Access

+

+The data can be explored interactively with the +Table Browser or the +Data Integrator. +For programmatic access, our REST API can be used; the +track name is sgdpFreq. +For bulk download, the VCF file can be obtained from +our download server. +

+ +

The original source VCFs are available from +https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/vcf_variants/. +

+ +

Methods

+

+High-coverage whole-genome sequencing of 300 individuals (279 publicly available) from 142 +diverse populations was performed on Illumina instruments using PCR-free library preparation at +an average depth of 43x. Reads were aligned to the hs37d5 reference (GRCh37 with decoy +sequences) using BWA-MEM 0.7.12. SNP and indel genotyping was performed using GATK +HaplotypeCaller with joint genotyping across all samples. An independent indel callset was +generated using FermiKit for improved sensitivity at complex variants. The final dataset +contains 34.4 million SNPs and 2.1 million short indels. +

+

+The VCFs were merged with bcftools and lifted to hg38 with CrossMap. At UCSC, genotypes were +stripped to produce a sites-only frequency VCF retaining the existing AC, AF, and AN INFO fields. +We provide documentation that indicates how all source files were converted in the makeDoc file of the track. +Python scripts are also available from Github. +

+ +

Credits

+

+This project was funded by the Simons Foundation. Thanks to David Reich and Swapan +Mallick for help with importing the data. +

+ +

References

+

+Mallick S, Li H, Lipson M, Mathieson I, Gymrek M, Racimo F, Zhao M, Chennagiri N, Nordenfelt S, +Tandon A et al. + +The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. +Nature. 2016 Oct 13;538(7624):201-206. +PMID: 27654912; PMC: PMC5161557 +