6bb46ba4e8d91ab3670d354ef18d8bf5321ec9ee max Thu Mar 12 07:15:29 2026 -0700 Add WebSTR short tandem repeat track under new strVar supertrack New track with 1.7M STR loci from WebSTR EnsembleTR panel (hg38), with allele frequency data for 5 populations from 1000 Genomes (3,550 samples). Includes conversion script, .as schema, trackDb, and full HTML documentation, refs #36652 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> diff --git src/hg/makeDb/trackDb/human/hg38/webstr.html src/hg/makeDb/trackDb/human/hg38/webstr.html new file mode 100644 index 00000000000..35d8cae51b7 --- /dev/null +++ src/hg/makeDb/trackDb/human/hg38/webstr.html @@ -0,0 +1,115 @@ +<h2>Description</h2> +<p> +The <b>WebSTR</b> track displays 1,710,833 short tandem repeat (STR) loci across the +human genome from the +<a href="https://webstr.ucsd.edu" target="_blank">WebSTR</a> database. STRs (also known +as microsatellites) are consecutive repetitions of 1–6 nucleotide motifs that are +highly polymorphic due to repeat unit insertions and deletions caused primarily by +polymerase slippage during replication. Genetic variation at STRs has been shown to +influence gene expression, cancer risk, and neurodevelopmental traits.</p> + +<p> +This track is based on the <b>EnsembleTR panel</b> for the GRCh38/hg38 assembly, +which represents a combined set of tandem repeats genotyped by four separate methods +(HipSTR, GangSTR, ExpansionHunter, and AdVNTR) on data from the +<a href="https://www.internationalgenome.org/" target="_blank">1000 Genomes Project</a> +and <a href="https://h3africa.org/" target="_blank">H3Africa</a>. +<a href="https://github.com/gymrek-lab/EnsembleTR" target="_blank">EnsembleTR</a> +was applied to jointly genotype all 3,550 samples, producing consensus calls at +over 1.7 million autosomal tandem repeat loci.</p> + +<p> +The track includes allele frequency distributions for five 1000 Genomes continental +populations:</p> +<ul> +<li>AFR – African (893 samples)</li> +<li>AMR – Admixed American (490 samples)</li> +<li>EAS – East Asian (585 samples)</li> +<li>EUR – European (633 samples)</li> +<li>SAS – South Asian (601 samples)</li> +</ul> + +<p> +For each population, allele frequencies are defined as the number of copies of each allele +divided by the total number of alleles in that population. Alleles are represented as +the number of repeat unit copies.</p> + +<h2>Display Conventions</h2> +<p> +Items are colored by the length of the repeat motif (period):</p> +<ul> +<li><span style="color: #FF0000;">Red</span> – mononucleotide (period 1)</li> +<li><span style="color: #0000FF;">Blue</span> – dinucleotide (period 2)</li> +<li><span style="color: #008000;">Green</span> – trinucleotide (period 3)</li> +<li><span style="color: #FFA500;">Orange</span> – tetranucleotide (period 4)</li> +<li><span style="color: #800080;">Purple</span> – pentanucleotide (period 5)</li> +<li><span style="color: #4682B4;">Steel blue</span> – hexanucleotide (period 6)</li> +<li><span style="color: #808080;">Gray</span> – longer motifs (period >6)</li> +</ul> + +<p> +Each item is labeled by its WebSTR repeat ID. Hovering over an item shows the repeat +motif, number of reference copies, and motif period. Clicking an item links to the +corresponding +<a href="https://webstr.ucsd.edu" target="_blank">WebSTR</a> locus page, which provides +interactive allele frequency histograms and additional annotations.</p> + +<h2>Methods</h2> +<p> +The EnsembleTR reference panel was constructed as follows:</p> +<ol> +<li>Tandem repeat reference sets from four genotyping tools (HipSTR, GangSTR, +ExpansionHunter, and AdVNTR) were merged.</li> +<li>Each tool was run independently on 1000 Genomes and H3Africa whole-genome +sequencing data.</li> +<li><a href="https://github.com/gymrek-lab/EnsembleTR" target="_blank">EnsembleTR</a> +was used to produce joint consensus genotype calls across all four methods.</li> +<li>Loci called in fewer than 75% of samples were removed, yielding 1,710,833 loci.</li> +<li>Allele frequencies were computed per population.</li> +</ol> + +<p> +For the UCSC Genome Browser track, the source data were converted from CSV to bigBed +format. The 1-based start coordinates from the WebSTR database were converted to 0-based +half-open coordinates for the BED format. Per-population allele frequency distributions +are stored as extra bigBed fields.</p> + +<h2>Data Access</h2> +<p> +The raw data can be explored interactively with the +<a href="../cgi-bin/hgTables" target="_blank">Table Browser</a> or the +<a href="../cgi-bin/hgIntegrator" target="_blank">Data Integrator</a>. For automated +analysis, the data may be queried from our +<a href="/goldenPath/help/api.html" target="_blank">REST API</a>. The underlying bigBed +file can be downloaded from our +<a href="http://hgdownload.soe.ucsc.edu/gbdb/$db/webstr/" target="_blank">download +server</a>.</p> + +<p> +The complete WebSTR dataset, including additional cohorts and data types not included in +this track, is available from the +<a href="https://webstr.ucsd.edu" target="_blank">WebSTR web portal</a>. Programmatic +access to the full WebSTR database is available through the +<a href="http://webstr-api.ucsd.edu/docs" target="_blank">WebSTR REST API</a>.</p> + +<h2>Credits</h2> +<p> +Thanks to Melissa Gymrek (UC San Diego), Oxana Sachenkova Lundström +(Stockholm University / ZHAW), and the WebSTR team for providing the data for this track.</p> + +<h2>References</h2> +<p> +Sachenkova Lundström O, Adriaan Verbiest M, Xia F, Jam HZ, Zlobec I, +Anisimova M, Gymrek M. +<a href="https://doi.org/10.1016/j.jmb.2023.168260" target="_blank"> +WebSTR: A Population-wide Database of Short Tandem Repeat Variation in Humans</a>. +<em>J Mol Biol</em>. 2023 Oct 15;435(20):168260. +PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/37678708" target="_blank">37678708</a> +</p> + +<p> +Jam HZ, Revoir P, Gadgil R, Sun Y, Gymrek M. +<a href="https://doi.org/10.1038/s41587-023-02057-3" target="_blank"> +EnsembleTR: a tool for combining tandem repeat genotyping results</a>. +<em>Nat Biotechnol</em>. 2024. +</p>