37a5b97198453bd06cb03d2092cae239f368e84c max Wed Apr 29 05:49:14 2026 -0700 varFreqs: add tishkoff180 subtrack (Fan et al. 2023, 180 indigenous African WGS, hg19 lift) Sites-only SNP VCF with aggregate AC/AF/AN from 180 individuals (15 each from 12 populations across Ethiopia, Tanzania, Cameroon, Botswana), sequenced at >30x on HiSeq X Ten. hg19 calls supplied by the Tishkoff lab (UPenn) and lifted to hg38 with CrossMap. Redistribution is not permitted, so tableBrowser is disabled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>, refs #36642 diff --git src/hg/makeDb/trackDb/human/tishkoff180.html src/hg/makeDb/trackDb/human/tishkoff180.html new file mode 100644 index 00000000000..661aa78a1f4 --- /dev/null +++ src/hg/makeDb/trackDb/human/tishkoff180.html @@ -0,0 +1,93 @@ +<h2>Description</h2> +<p> +This track shows allele frequencies from high-coverage whole-genome sequencing of +180 individuals (15 per population) from 12 indigenous African populations representing +all four major African language phyla (Khoesan, Niger-Congo, Nilo-Saharan, Afroasiatic). +The cohort, generated by the Tishkoff lab and collaborators (Fan et al., <em>Cell</em> 2023), +spans the Amhara, Dizi, Chabu and Mursi from Ethiopia; the Hadza and Sandawe from Tanzania; +the Central African rainforest hunter-gatherers (Baka and Bagyeli, merged), Fulani and Tikari +from Cameroon; and the Herero, Ju|'hoansi and !Xoo (the latter two collectively the "San") +from Botswana. The dataset was generated to capture demographic history and signatures of +local adaptation in African populations that are poorly represented in other reference panels. +</p> + +<p> +Only aggregate allele frequencies (AC, AF, AN summed over all 180 individuals) are +shown for each variant; per-population frequencies are not provided in the released +sites VCF. The original variant calls were on the GRCh37/hs37d5 reference and were +lifted to hg38 at UCSC. +</p> + +<h2>Display Conventions</h2> +<p> +Variants display as standard VCF allele frequency tracks. On mouseover and click, +the allele count (AC), total allele number (AN) and allele frequency (AF) are shown. +When zoomed in, alleles are colored by base. Multi-allelic records were split into +biallelic rows during normalization upstream. +</p> + +<h2>Methods</h2> +<p> +Whole genome sequencing of 180 individuals (15 unrelated samples per population) +was performed at >30× average coverage on the Illumina HiSeq X Ten platform +using PCR-free library preparation with paired-end 150 bp reads and a 350 bp +insert size. Adapters were trimmed with trimadap, optical duplicates were marked with +SAMBLASTER (v0.1.22), and reads were aligned to the hs37d5 decoy version of GRCh37 +with BWA-MEM (v0.7.10). Reads with mapping quality < 20 were filtered. Per-sample +short variants were called with GATK HaplotypeCaller (nightly-2016-09-26-gfade77f) in +gVCF mode using a custom genotype prior (0.4995, 0.001, 0.4995) to reduce reference +bias, following the SGDP recommendation. Joint genotyping was performed with GATK +GenotypeGVCFs. Variants were filtered with GATK VQSR using 1000 Genomes Phase 3, +Illumina Omni 5M and HapMap as SNP truth sets and Mills indels as the indel truth set. +Variants overlapping potential duplications detected by Delly (v0.7.6) and low-complexity +regions were excluded. After QC the cohort yielded 32.4 M SNPs and 2.8 M small +indels. The publicly released SNP-only sites VCF used here contains 33.6 M +biallelic SNPs with aggregate AC/AF/AN summaries. See Fan et al. (2023) for full +methods. +</p> + +<p> +The hg19 SNPs sites VCF was provided directly by Matthew Hansen at the Tishkoff lab +(University of Pennsylvania) via a Box link +(<tt>180wgs.SNPs.sites.AF.vcf.gz</tt>). Bare chromosome names (1-22) were converted +to UCSC-style names with <tt>bcftools annotate --rename-chrs</tt>, the VCF was lifted +from hg19 to hg38 with <tt>CrossMap.py vcf</tt> using the UCSC +<tt>hg19ToHg38.over.chain.gz</tt> chain, then sorted, bgzip-compressed and tabix-indexed +with <tt>bcftools sort</tt> and <tt>tabix</tt>. Step-by-step processing instructions are in +the +<a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/varFreqs.txt" +target="_blank">makeDoc file</a>; the supporting scripts live under +<a href="https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/scripts/varFreqs" +target="_blank">kent/src/hg/makeDb/scripts/varFreqs</a>. +</p> + +<h2>Data Access</h2> +<p> +The original (hg19) variant calls and supplementary data accompany the publication; +see the "Data and code availability" section of Fan et al. (2023). The dataset is +not available for redistribution from our website, so the Table Browser, Data +Integrator and download server are disabled for this track. The hg19 sites VCF can +be requested from the Tishkoff lab at the University of Pennsylvania. +</p> + +<h2>Credits</h2> +<p> +Thanks to Matthew Hansen and Sarah Tishkoff (University of Pennsylvania) for sharing +the sites-only allele-frequency VCF, and to all participating individuals and field +collaborators in Ethiopia, Tanzania, Cameroon and Botswana whose contributions made +this dataset possible. +</p> + +<h2>References</h2> + +<p> +Fan S, Spence JP, Feng Y, Hansen MEB, Terhorst J, Beltrame MH, Ranciaro A, Hirbo J, Beggs W, Thomas +N <em>et al</em>. +<a href="https://linkinghub.elsevier.com/retrieve/pii/S0092-8674(23)00101-0" target="_blank"> +Whole-genome sequencing reveals a complex African population demographic history and signatures of +local adaptation</a>. +<em>Cell</em>. 2023 Mar 2;186(5):923-939.e14. +PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/36868214" target="_blank">36868214</a>; PMC: <a +href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10568978/" target="_blank">PMC10568978</a> +</p> +