0a5389e8b81efa99e932f5e9e2424b27ce1af095 max Sun May 17 14:53:57 2026 -0700 varFreqs: add GenomeIndia subtrack (9,768 WGS, 83 endogamous Indian populations, Bhattacharyya 2025) and wire into databases.tsv for the next varFreqsAll rebuild. TSV->VCF conversion synthesizes AC=round(AF*AN) with AN=19536 since the upstream release only ships AF, refs #36642 Co-Authored-By: Claude Opus 4.7 (1M context) diff --git src/hg/makeDb/trackDb/human/genomeindia.html src/hg/makeDb/trackDb/human/genomeindia.html new file mode 100644 index 00000000000..ad13b6c2855 --- /dev/null +++ src/hg/makeDb/trackDb/human/genomeindia.html @@ -0,0 +1,107 @@ +

Description

+

+The GenomeIndia +project is a national initiative coordinating academic and medical institutions +across India to characterize the genetic diversity of the Indian subcontinent. The +release used by this track comprises whole-genome sequencing of 9,768 healthy adults +sampled from 83 anthropologically defined endogamous populations spanning India's +ethnolinguistic and biogeographic spectrum (Indo-European, Dravidian, Austroasiatic, +and Tibeto-Burman language families, plus a continentally admixed outgroup). After +joint genotyping and quality filtering, 129,938,889 high-confidence biallelic +variants (~121M SNVs and ~8M indels) were reported, of which roughly one third are +absent from gnomAD, 1000 Genomes, and GenomeAsia. This track shows the alternate +allele frequency in that 9,768-sample autosomal call set. +

+

+Because Indian populations are profoundly underrepresented in global variant +databases, many globally rare alleles reach much higher frequencies in specific +endogamous groups. The release ships only the cohort-wide alternate allele +frequency (no per-population breakdown), so this track shows the overall +GenomeIndia AF; AC is derived from AF (see Methods). +

+ +

Display Conventions

+

+Variants are shown as a VCF dense track. Each row reports the genomic position, +ref/alt alleles, the GenomeIndia alternate allele frequency, and a synthesized +allele count. The track only includes autosomal variants (chr1–chr22); chrX, +chrY, and chrM are not in the current release. +

+ +

Data Access

+

+The data can be explored interactively with the +Table Browser or the +Data Integrator. +For programmatic access, our REST API +can be used; the track name is genomeindia. +For bulk download, the VCF file can be obtained from +our download server. +

+

+The original per-chromosome TSV summary statistics can be downloaded directly from +the GenomeIndia Data Centre at ibdc.dbtindia.gov.in +(the 9768GI_SummaryStats.tar.gz bundle). Use of the data is subject to +the GenomeIndia data-access policy listed on that page. +

+ +

Methods

+

+PCR-free whole-genome sequencing libraries were prepared from blood-derived DNA and +sequenced on Illumina NovaSeq 6000 to a per-sample average depth of at least 23×. +Reads were processed with the Illumina DRAGEN v4.0.3 germline pipeline against +GRCh38, producing per-sample gVCFs that were then joint-genotyped with the Illumina +gVCF genotyper. Site-level filters retained only PASS variants with +QUAL ≥ 30, posterior genotype probability ≥ 99.9%, GQ > 20 +at every site (GQ > 40 for singletons and doubletons), heterozygous allele +balance ≥ 0.2, call rate ≥ 98%, and Hardy–Weinberg equilibrium +p > 1×10-11; sites with an inbreeding coefficient of 1 +were also excluded as technical artefacts. Variants were annotated for protein +impact with Ensembl VEP v113 plus LOFTEE; details are in the published methods +(Bhattacharyya et al. 2025, see References). +

+

+The release was downloaded from +ibdc.dbtindia.gov.in as 9768GI_SummaryStats.tar.gz, which +contains 22 per-chromosome TSV files of CHROM, POS, ID, REF, ALT, AF (no header). +The TSV files were converted to a single sorted, bgzipped, tabix-indexed VCF by the +script genomeindiaToVcf.py. The release ships only AF; AC and AN are +synthesized as AN = 2 × 9768 = 19536 and +AC = round(AF × AN). Because variants were retained only +when called in ≥98% of samples, AN slightly overstates the true called allele +count for some sites (worst case ~2%); the AC field should therefore be read as a +close approximation rather than the exact observed count. The exact processing +steps are documented in the makeDoc file. +

+ +

Credits

+

+We thank the GenomeIndia consortium for making the 9,768-sample summary statistics +publicly available. The track was built at UCSC by Max Haeussler. +

+ +

References

+

+Bhattacharyya C, Subramanian K, Uppili B, Biswas NK, Ramdas S, Tallapaka KB, Arvind P, Rupanagudi +KV, Maitra A, Nagabandi T et al. + +Mapping genetic diversity with the GenomeIndia project. +Nat Genet. 2025 Apr;57(4):767-773. +PMID: 40200122 +

+ +

+Subramanian K, Bhattacharyya C, Machha P, Mukherjee A, Tripathi D, Chakraborty S, Majumdar SS, +Sengupta S, Singh P, More V et al; GenomeIndia Consortium. + +An Atlas of Indian Genetic Diversity. +medRxiv. 2026 Mar 20;2026.03.20.26348801 (preprint). +

+