9bfd58221b1539193cb7f0a317b4e959c1c7e49a
max
Thu May 21 01:00:45 2026 -0700
varFreqs: AI generated text sounds bad, hard to read, so remove typical AI language. "humanizer" pass on all 31 varFreqs description pages — cut em dashes, copula avoidance ("serves as", "stands as"), "-ing" puffery, and boilerplate filler ("We provide documentation that indicates how..."). Title-case headings and meaningful <b> emphasis preserved. No facts/URLs/counts/versions changed. tpmi.html added as a new file (was previously uncommitted). refs #36642
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
diff --git src/hg/makeDb/trackDb/human/genomeindia.html src/hg/makeDb/trackDb/human/genomeindia.html
index ad13b6c2855..d54a061ff4c 100644
--- src/hg/makeDb/trackDb/human/genomeindia.html
+++ src/hg/makeDb/trackDb/human/genomeindia.html
@@ -1,107 +1,107 @@
<h2>Description</h2>
<p>
The <a href="https://ibdc.dbtindia.gov.in/genomeindia/" target="_blank">GenomeIndia
-project</a> is a national initiative coordinating academic and medical institutions
+project</a> is a national initiative that coordinates academic and medical institutions
across India to characterize the genetic diversity of the Indian subcontinent. The
-release used by this track comprises whole-genome sequencing of 9,768 healthy adults
-sampled from 83 anthropologically defined endogamous populations spanning India's
-ethnolinguistic and biogeographic spectrum (Indo-European, Dravidian, Austroasiatic,
+release used by this track is whole-genome sequencing of 9,768 healthy adults
+sampled from 83 anthropologically defined endogamous populations across India's
+ethnolinguistic and biogeographic range (Indo-European, Dravidian, Austroasiatic,
and Tibeto-Burman language families, plus a continentally admixed outgroup). After
joint genotyping and quality filtering, 129,938,889 high-confidence biallelic
variants (~121M SNVs and ~8M indels) were reported, of which roughly one third are
absent from gnomAD, 1000 Genomes, and GenomeAsia. This track shows the alternate
allele frequency in that 9,768-sample autosomal call set.
</p>
<p>
-Because Indian populations are profoundly underrepresented in global variant
-databases, many globally rare alleles reach much higher frequencies in specific
+Indian populations are underrepresented in global variant
+databases, so many globally rare alleles are at much higher frequencies in specific
endogamous groups. The release ships only the cohort-wide alternate allele
frequency (no per-population breakdown), so this track shows the overall
GenomeIndia AF; AC is derived from AF (see Methods).
</p>
<h2>Display Conventions</h2>
<p>
Variants are shown as a VCF dense track. Each row reports the genomic position,
ref/alt alleles, the GenomeIndia alternate allele frequency, and a synthesized
allele count. The track only includes autosomal variants (chr1–chr22); chrX,
chrY, and chrM are not in the current release.
</p>
<h2>Data Access</h2>
<p>
The data can be explored interactively with the
<a href="../cgi-bin/hgTables">Table Browser</a> or the
<a href="../cgi-bin/hgIntegrator">Data Integrator</a>.
For programmatic access, our <a href="https://api.genome.ucsc.edu" target="_blank">REST API</a>
can be used; the track name is <em>genomeindia</em>.
For bulk download, the VCF file can be obtained from
<a href="http://hgdownload.soe.ucsc.edu/gbdb/hg38/varFreqs/_genomeindia/"
target="_blank">our download server</a>.
</p>
<p>
The original per-chromosome TSV summary statistics can be downloaded directly from
the GenomeIndia Data Centre at <a
href="https://ibdc.dbtindia.gov.in/genomeindia/" target="_blank">ibdc.dbtindia.gov.in</a>
(the <tt>9768GI_SummaryStats.tar.gz</tt> bundle). Use of the data is subject to
the GenomeIndia data-access policy listed on that page.
</p>
<h2>Methods</h2>
<p>
PCR-free whole-genome sequencing libraries were prepared from blood-derived DNA and
sequenced on Illumina NovaSeq 6000 to a per-sample average depth of at least 23×.
Reads were processed with the Illumina DRAGEN v4.0.3 germline pipeline against
-GRCh38, producing per-sample gVCFs that were then joint-genotyped with the Illumina
+GRCh38. The resulting per-sample gVCFs were then joint-genotyped with the Illumina
gVCF genotyper. Site-level filters retained only PASS variants with
QUAL ≥ 30, posterior genotype probability ≥ 99.9%, GQ > 20
at every site (GQ > 40 for singletons and doubletons), heterozygous allele
balance ≥ 0.2, call rate ≥ 98%, and Hardy–Weinberg equilibrium
p > 1×10<sup>-11</sup>; sites with an inbreeding coefficient of 1
were also excluded as technical artefacts. Variants were annotated for protein
impact with Ensembl VEP v113 plus LOFTEE; details are in the published methods
(Bhattacharyya et al. 2025, see References).
</p>
<p>
The release was downloaded from
<a href="https://ibdc.dbtindia.gov.in/genomeindia/downloadfile?path=9768GI_SummaryStats.tar.gz"
target="_blank">ibdc.dbtindia.gov.in</a> as <tt>9768GI_SummaryStats.tar.gz</tt>, which
contains 22 per-chromosome TSV files of CHROM, POS, ID, REF, ALT, AF (no header).
The TSV files were converted to a single sorted, bgzipped, tabix-indexed VCF by the
script <a
href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/scripts/varFreqs/genomeindiaToVcf.py"
target="_blank">genomeindiaToVcf.py</a>. The release ships only AF; AC and AN are
synthesized as AN = 2 × 9768 = 19536 and
-AC = round(AF × AN). Because variants were retained only
-when called in ≥98% of samples, AN slightly overstates the true called allele
-count for some sites (worst case ~2%); the AC field should therefore be read as a
-close approximation rather than the exact observed count. The exact processing
+AC = round(AF × AN). Variants were kept only
+when called in ≥98% of samples, so AN slightly overstates the true called allele
+count for some sites (worst case ~2%); the AC field is a
+close approximation, not the exact observed count. The processing
steps are documented in the <a
href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/varFreqs.txt"
target="_blank">makeDoc file</a>.
</p>
<h2>Credits</h2>
<p>
We thank the GenomeIndia consortium for making the 9,768-sample summary statistics
publicly available. The track was built at UCSC by Max Haeussler.
</p>
<h2>References</h2>
<p>
Bhattacharyya C, Subramanian K, Uppili B, Biswas NK, Ramdas S, Tallapaka KB, Arvind P, Rupanagudi
KV, Maitra A, Nagabandi T <em>et al</em>.
<a href="https://doi.org/10.1038/s41588-025-02153-x" target="_blank">
Mapping genetic diversity with the GenomeIndia project</a>.
<em>Nat Genet</em>. 2025 Apr;57(4):767-773.
PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/40200122" target="_blank">40200122</a>
</p>
<p>
Subramanian K, Bhattacharyya C, Machha P, Mukherjee A, Tripathi D, Chakraborty S, Majumdar SS,
Sengupta S, Singh P, More V <em>et al</em>; GenomeIndia Consortium.
<a href="https://doi.org/10.64898/2026.03.20.26348801" target="_blank">
An Atlas of Indian Genetic Diversity</a>.
<em>medRxiv</em>. 2026 Mar 20;2026.03.20.26348801 (preprint).
</p>