0a5389e8b81efa99e932f5e9e2424b27ce1af095 max Sun May 17 14:53:57 2026 -0700 varFreqs: add GenomeIndia subtrack (9,768 WGS, 83 endogamous Indian populations, Bhattacharyya 2025) and wire into databases.tsv for the next varFreqsAll rebuild. TSV->VCF conversion synthesizes AC=round(AF*AN) with AN=19536 since the upstream release only ships AF, refs #36642 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> diff --git src/hg/makeDb/trackDb/human/genomeindia.html src/hg/makeDb/trackDb/human/genomeindia.html new file mode 100644 index 00000000000..ad13b6c2855 --- /dev/null +++ src/hg/makeDb/trackDb/human/genomeindia.html @@ -0,0 +1,107 @@ +<h2>Description</h2> +<p> +The <a href="https://ibdc.dbtindia.gov.in/genomeindia/" target="_blank">GenomeIndia +project</a> is a national initiative coordinating academic and medical institutions +across India to characterize the genetic diversity of the Indian subcontinent. The +release used by this track comprises whole-genome sequencing of 9,768 healthy adults +sampled from 83 anthropologically defined endogamous populations spanning India's +ethnolinguistic and biogeographic spectrum (Indo-European, Dravidian, Austroasiatic, +and Tibeto-Burman language families, plus a continentally admixed outgroup). After +joint genotyping and quality filtering, 129,938,889 high-confidence biallelic +variants (~121M SNVs and ~8M indels) were reported, of which roughly one third are +absent from gnomAD, 1000 Genomes, and GenomeAsia. This track shows the alternate +allele frequency in that 9,768-sample autosomal call set. +</p> +<p> +Because Indian populations are profoundly underrepresented in global variant +databases, many globally rare alleles reach much higher frequencies in specific +endogamous groups. The release ships only the cohort-wide alternate allele +frequency (no per-population breakdown), so this track shows the overall +GenomeIndia AF; AC is derived from AF (see Methods). +</p> + +<h2>Display Conventions</h2> +<p> +Variants are shown as a VCF dense track. Each row reports the genomic position, +ref/alt alleles, the GenomeIndia alternate allele frequency, and a synthesized +allele count. The track only includes autosomal variants (chr1–chr22); chrX, +chrY, and chrM are not in the current release. +</p> + +<h2>Data Access</h2> +<p> +The data can be explored interactively with the +<a href="../cgi-bin/hgTables">Table Browser</a> or the +<a href="../cgi-bin/hgIntegrator">Data Integrator</a>. +For programmatic access, our <a href="https://api.genome.ucsc.edu" target="_blank">REST API</a> +can be used; the track name is <em>genomeindia</em>. +For bulk download, the VCF file can be obtained from +<a href="http://hgdownload.soe.ucsc.edu/gbdb/hg38/varFreqs/_genomeindia/" +target="_blank">our download server</a>. +</p> +<p> +The original per-chromosome TSV summary statistics can be downloaded directly from +the GenomeIndia Data Centre at <a +href="https://ibdc.dbtindia.gov.in/genomeindia/" target="_blank">ibdc.dbtindia.gov.in</a> +(the <tt>9768GI_SummaryStats.tar.gz</tt> bundle). Use of the data is subject to +the GenomeIndia data-access policy listed on that page. +</p> + +<h2>Methods</h2> +<p> +PCR-free whole-genome sequencing libraries were prepared from blood-derived DNA and +sequenced on Illumina NovaSeq 6000 to a per-sample average depth of at least 23×. +Reads were processed with the Illumina DRAGEN v4.0.3 germline pipeline against +GRCh38, producing per-sample gVCFs that were then joint-genotyped with the Illumina +gVCF genotyper. Site-level filters retained only PASS variants with +QUAL ≥ 30, posterior genotype probability ≥ 99.9%, GQ > 20 +at every site (GQ > 40 for singletons and doubletons), heterozygous allele +balance ≥ 0.2, call rate ≥ 98%, and Hardy–Weinberg equilibrium +p > 1×10<sup>-11</sup>; sites with an inbreeding coefficient of 1 +were also excluded as technical artefacts. Variants were annotated for protein +impact with Ensembl VEP v113 plus LOFTEE; details are in the published methods +(Bhattacharyya et al. 2025, see References). +</p> +<p> +The release was downloaded from +<a href="https://ibdc.dbtindia.gov.in/genomeindia/downloadfile?path=9768GI_SummaryStats.tar.gz" +target="_blank">ibdc.dbtindia.gov.in</a> as <tt>9768GI_SummaryStats.tar.gz</tt>, which +contains 22 per-chromosome TSV files of CHROM, POS, ID, REF, ALT, AF (no header). +The TSV files were converted to a single sorted, bgzipped, tabix-indexed VCF by the +script <a +href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/scripts/varFreqs/genomeindiaToVcf.py" +target="_blank">genomeindiaToVcf.py</a>. The release ships only AF; AC and AN are +synthesized as AN = 2 × 9768 = 19536 and +AC = round(AF × AN). Because variants were retained only +when called in ≥98% of samples, AN slightly overstates the true called allele +count for some sites (worst case ~2%); the AC field should therefore be read as a +close approximation rather than the exact observed count. The exact processing +steps are documented in the <a +href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/varFreqs.txt" +target="_blank">makeDoc file</a>. +</p> + +<h2>Credits</h2> +<p> +We thank the GenomeIndia consortium for making the 9,768-sample summary statistics +publicly available. The track was built at UCSC by Max Haeussler. +</p> + +<h2>References</h2> +<p> +Bhattacharyya C, Subramanian K, Uppili B, Biswas NK, Ramdas S, Tallapaka KB, Arvind P, Rupanagudi +KV, Maitra A, Nagabandi T <em>et al</em>. +<a href="https://doi.org/10.1038/s41588-025-02153-x" target="_blank"> +Mapping genetic diversity with the GenomeIndia project</a>. +<em>Nat Genet</em>. 2025 Apr;57(4):767-773. +PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/40200122" target="_blank">40200122</a> +</p> + +<p> +Subramanian K, Bhattacharyya C, Machha P, Mukherjee A, Tripathi D, Chakraborty S, Majumdar SS, +Sengupta S, Singh P, More V <em>et al</em>; GenomeIndia Consortium. +<a href="https://doi.org/10.64898/2026.03.20.26348801" target="_blank"> +An Atlas of Indian Genetic Diversity</a>. +<em>medRxiv</em>. 2026 Mar 20;2026.03.20.26348801 (preprint). +</p> +