--------------------------------------------------------------- hg38.trackDb.html : Differences exist between hgwbeta and hgw2 (RR fields taken from public MySql server, not individual machine) 97,161d96 < abraom | html < abraom |

Description

< abraom |

< abraom | The Arquivo Brasileiro Online de < abraom | Mutações (ABraOM) provides genomic variants obtained with whole-genome sequencing < abraom | from SABE, a census-based sample of elderly individuals from São Paulo, Brazil's largest < abraom | city. The Brazilian population is constituted by ~500 years of admixture between Africans, < abraom | Europeans, and Native Americans. Additionally, the cohort presents ~3% of individuals with < abraom | non-admixed Japanese ancestry (early 20th century migration). Coverage 38.6x. TEs, HLAs and < abraom | new sequence are also available. < abraom |

< abraom | < abraom |

Data Access

< abraom |

< abraom | The data can be explored interactively with the < abraom | Table Browser or the < abraom | Data Integrator. < abraom | For programmatic access, our REST API can be used; the < abraom | track name is abraom. < abraom | For bulk download, the VCF file can be obtained from < abraom | our download server. < abraom |

< abraom |

< abraom | The original data can also be downloaded from the ABraOM website. < abraom |

< abraom | < abraom |

Methods

< abraom |

< abraom | For academic use only. Licensing for commercial use might be available under request and agreement. < abraom | By using this resource you agree to cite the flagship paper (Naslavsky et al. Nat Comm 2022). < abraom |

< abraom |

< abraom | Whole-genome sequencing was performed at Human Longevity Inc. using TruSeq Nano DNA HT libraries < abraom | sequenced on Illumina HiSeqX instruments with 150 bp paired-end reads targeting 30x coverage, and < abraom | reads were mapped to GRCh38 using ISIS software. Sample sex was validated by comparing CPMs of X < abraom | chromosome and male-specific Y (MSY) reads relative to autosomes, yielding the expected female < abraom | (~55,000 X CPM, <200 MSY CPM) and male (~27,500 X CPM, >550 MSY CPM) patterns. Germline SNVs < abraom | and indels were called following GATK Best Practices (GATK v3.7) via per-sample GVCFs < abraom | (HaplotypeCaller), joint genotyping (CombineGVCFs, GenotypeGVCFs), and Variant Quality Score < abraom | Recalibration (VQSR-AS); multiallelic variants were split with an in-house script, left-aligned with < abraom | BCFtools, and annotated using Annovar and custom scripts against dbSNP, 1000 Genomes, and gnomAD, < abraom | with putative loss-of-function variants identified using LOFTEE v0.3-beta irrespective of confidence < abraom | labels. Variant and genotype quality was further assessed using the in-house CEGH-Filter two-step < abraom | algorithm based on depth and allele balance, and analyses retained only GATK VQSR-AS PASS variants < abraom | and higher-confidence CEGH-Filter calls. Relatedness was assessed using KING and PC-Relate < abraom | (GENESIS), retaining a single proband per related pair and excluding one contaminated sample < abraom | (>3% by verifyBAMID), resulting in a final dataset of 1,171 unrelated individuals. Final samples < abraom | achieved mean coverages ranging from 31.3x to 64.8x, with an average of 38.65x and a median of < abraom | 36.6x. < abraom | We provide documentation that indicates how all source files of the varFreqs track were converted in the makeDoc file of the track. < abraom | For some tracks, python scripts were necessary and are also available from GitHub. < abraom |

< abraom | < abraom |

References

< abraom |

< abraom | Naslavsky MS, Scliar MO, Yamamoto GL, Wang JYT, Zverinova S, Karp T, Nunes K, Ceroni JRM, de < abraom | Carvalho DL, da Silva Simões CE et al. < abraom | < abraom | Whole-genome sequencing of 1,171 elderly admixed individuals from São Paulo, Brazil. < abraom | Nat Commun. 2022 Mar 4;13(1):1004. < abraom | PMID: 35246524; PMC: PMC8897431 < abraom |

< abraom | 1089,1134d1023 < alfaVcf | html < alfaVcf |

Description

< alfaVcf |

< alfaVcf | The NCBI ALlele Frequency < alfaVcf | Aggregator (ALFA) pipeline computes allele frequencies from approved, unrestricted dbGaP studies < alfaVcf | and makes them publicly available through dbSNP. Its goal is to release frequency data from over < alfaVcf | one million dbGaP subjects to aid discoveries involving common and rare variants with biological < alfaVcf | or disease relevance. The R4 release includes 408,709 subjects and allele frequencies for < alfaVcf | 15.5 million rs sites, including nearly one million ClinVar variants. < alfaVcf |

< alfaVcf | < alfaVcf |

Data Access

< alfaVcf |

< alfaVcf | The data can be explored interactively with the < alfaVcf | Table Browser or the < alfaVcf | Data Integrator. < alfaVcf | For programmatic access, our REST API can be used; the < alfaVcf | track name is alfaVcf. < alfaVcf | For bulk download, the VCF file can be obtained from < alfaVcf | our download server. < alfaVcf |

< alfaVcf |

< alfaVcf | We converted the NCBI track hub to VCF format; the data is freely available. < alfaVcf | Genotype and associated individual-level data are accessible through the dbGaP < alfaVcf | authorized access request system. < alfaVcf |

< alfaVcf | < alfaVcf |

Methods

< alfaVcf |

< alfaVcf | The ALFA pipeline processes genotype data from approved, unrestricted dbGaP studies, including < alfaVcf | chip array, exome, and genomic sequencing data. Selected study data undergoes quality assurance < alfaVcf | and transformation to standard VCF format. Variants are converted to SPDI notation and normalized < alfaVcf | using VOCA, then aggregated, remapped, and clustered to existing dbSNP rs identifiers or assigned < alfaVcf | new ones. Sample ancestries are validated using GRAF-pop and assigned to 12 major populations. < alfaVcf | QC exclusions include variants and subjects with call rate <95%, datasets failing Ancestry < alfaVcf | Informative Markers consistency checks, and array datasets with conflicting or flipped allele < alfaVcf | orientation. < alfaVcf |

< alfaVcf |

< alfaVcf | The ALFA R4 bigBed files (904M variants) were converted to VCF using a custom script, retaining < alfaVcf | the 163M variants with non-zero allele frequency (146M SNPs, 17M indels). < alfaVcf | We provide documentation that indicates how all source files of the varFreqs track were converted in the makeDoc file of the track. < alfaVcf | For some tracks, python scripts were necessary and are also available from GitHub. < alfaVcf |

< alfaVcf | 1141,1205d1029 < allofus | html < allofus |

Description

< allofus |

< allofus | The All of Us Research Program is a < allofus | large-scale biomedical research initiative launched by the U.S. National Institutes of Health (NIH) < allofus | in 2018. Its goal is to build one of the most diverse health databases, enrolling over one < allofus | million participants who reflect the full diversity of the United States, including groups that < allofus | have been historically underrepresented in biomedical research. Participants contribute health < allofus | surveys, electronic health records (EHR), physical measurements, and biosamples for genomic < allofus | analysis. < allofus |

< allofus | < allofus |

< allofus | This track shows allele frequencies from the v7 short-read whole-genome sequencing (srWGS) < allofus | release of 245,388 participants. A minimum allele count filter of ≥20 was applied. < allofus | Frequencies are provided both overall and broken down by genetic ancestry using local ancestry < allofus | inference: European (EUR), East Asian (EAS), African (AFR), Indigenous American (AMR), < allofus | Oceanian (OCE), and South Asian (SAS). Some variants are flagged with an "NW" tag < allofus | (not in window) when the variant was not within a genomic window covered by the ancestry < allofus | reference files; in these cases the closest available position was used for ancestry assignment. < allofus |

< allofus | < allofus |

Data Access

< allofus |

< allofus | The data can be explored interactively with the < allofus | Table Browser or the < allofus | Data Integrator. < allofus | For programmatic access, our REST API can be used; the < allofus | track name is allofus. < allofus | For bulk download, the VCF file can be obtained from < allofus | our download server. < allofus |

< allofus |

< allofus | Variant data and individual-level data are accessible through the < allofus | All of Us Researcher Workbench, < allofus | which requires registration and completion of a training program. Aggregate allele frequency < allofus | data is freely available. < allofus |

< allofus | < allofus |

Methods

< allofus |

< allofus | Whole-genome sequencing was performed on the Illumina NovaSeq 6000 platform with PCR-free library < allofus | preparation targeting 30x coverage. Reads were aligned to GRCh38 and variants were called using < allofus | the Illumina DRAGEN (Dynamic Read Analysis for GENomics) pipeline, which performs mapping, < allofus | alignment, sorting, duplicate marking, and variant calling (SNVs and indels) in a single < allofus | hardware-accelerated workflow. Joint genotyping was performed across all samples. Quality control < allofus | included sample-level filtering for contamination, sex discordance, and relatedness, and < allofus | variant-level filtering using VQSR. < allofus | Population-specific allele frequencies were determined using local ancestry inference at UCSC by the Ioannidis group. < allofus | The ancestry breakdown into European, East Asian, African, Indigenous American, Oceanian, < allofus | and South Asian components is part of a pending publication. < allofus |

< allofus |

< allofus | At UCSC, we provide documentation that indicates how all source files of the varFreqs track were converted in the makeDoc file of the track. < allofus | For some tracks, python scripts were necessary and are also available from GitHub. < allofus |

< allofus | < allofus |

Credits

< allofus |

< allofus | The All of Us Research Program is supported by the National Institutes of Health. We thank the < allofus | participants and the program for making frequency data available. < allofus | The local ancestry inference was performed by Qudsi Aljabiri and Cole Shanks under < allofus | Prof. Alexander Ioannidis, UC Santa Cruz. < allofus |

< allofus | 32604,32661d32427 < finngen | html < finngen |

Description

< finngen |

< finngen | FinnGen is a public-private partnership < finngen | that combines genotype data from Finnish biobanks with digital health record data from Finnish < finngen | health registries. The R12 release contains imputed variants from 500,348 biobank samples obtained < finngen | using genotyping arrays. The imputation used phased variants obtained from 8,554 high-quality < finngen | whole genome sequences, also from Finland. This represents approximately 10% of the Finnish < finngen | population. Phenotype links can be viewed at the < finngen | FinnGen PheWeb. < finngen |

< finngen | < finngen |

Data Access

< finngen |

< finngen | Due to license restrictions, the data for this track cannot be downloaded from the UCSC < finngen | Genome Browser. The Table Browser, Data Integrator, and download server are not available < finngen | for this track. < finngen |

< finngen |

< finngen | TSV data can be requested via the form at < finngen | FinnGen, < finngen | which triggers an automated email containing the download link. < finngen | A script in our GitHub repo converts this file to VCF (see Methods below). < finngen |

< finngen | < finngen |

Methods

< finngen |

< finngen | FinnGen participants were genotyped using a custom Axiom FinnGen1 array, supplemented by legacy < finngen | collections genotyped with other arrays. Imputation used a population-specific reference panel of < finngen | high-coverage (25–30x) whole-genome sequences from Finnish individuals. Ancestry outliers were < finngen | removed via PCA against 1000 Genomes reference samples, and 5,780 duplicates and monozygotic twins < finngen | were excluded. Variant quality was assessed using VQSR. < finngen |

< finngen |

< finngen | R12 annotated variants were downloaded from the Google Cloud bucket link received through an email < finngen | and converted to VCF with a < finngen | custom Python script. < finngen | We provide documentation that indicates how all source files of the varFreqs track were converted in the makeDoc file of the track. < finngen | For some tracks, python scripts were necessary and are also available from GitHub. < finngen |

< finngen | < finngen |

Credits

< finngen |

< finngen | We want to acknowledge the participants and investigators of the FinnGen study. < finngen |

< finngen | < finngen |

References

< finngen |

< finngen | Kurki MI, Karjalainen J, Palta P, Sipilä TP, Kristiansson K, Donner KM, Reeve MP, Laivuori H, < finngen | Aavikko M, Kaunisto MA et al. < finngen | < finngen | FinnGen provides genetic insights from a well-phenotyped isolated population. < finngen | Nature. 2023 Jan;613(7944):508-518. < finngen | PMID: 36653562; PMC: PMC9849126 < finngen |

< finngen | 32945,33048d32710 < gasp | html < gasp |

Description

< gasp |

< gasp | The GenomeAsia 100K project aims < gasp | to sequence 100,000 Asian individuals. This pilot release (GAsP) contains whole-genome sequencing < gasp | data of 1,739 individuals from 219 population groups across Asia. Frequencies are broken down by < gasp | Northeast Asian, Southeast Asian, and South Asian ancestry groups. The data is split into two < gasp | subtracks: substitutions and indels. < gasp |

< gasp | < gasp |

Data Access

< gasp |

< gasp | The data can be explored interactively with the < gasp | Table Browser or the < gasp | Data Integrator. < gasp | For programmatic access, our REST API can be used; the < gasp | track name is gasp. < gasp | For bulk download, the VCF file can be obtained from < gasp | our download server. < gasp |

< gasp |

< gasp | The original VCFs are also available from the < gasp | GenomeAsia 100K < gasp | website. No license nor login is required. < gasp |

< gasp | < gasp |

Methods

< gasp |

< gasp | Samples were sequenced on Illumina HiSeq 2500, HiSeq 4000, and HiSeq X Ten instruments with < gasp | 2×100 bp or 2×150 bp paired-end reads at an average depth of 36x. Reads were aligned to < gasp | GRCh37 using BWA-MEM. Duplicate reads were marked with SAMBLASTER and sorted with Sambamba. < gasp | Per-sample variant calling was performed with GATK HaplotypeCaller in GVCF mode, followed by < gasp | joint genotyping with GenotypeGVCFs. Variant quality score recalibration (VQSR) was applied at < gasp | a 99% sensitivity tranche for both SNPs and indels. Sample-level QC included contamination < gasp | checks with verifyBamID and sex concordance verification. The final callset contains < gasp | ∼65 million variants across 1,739 individuals from 219 populations. < gasp |

< gasp |

< gasp | We provide documentation that indicates how all source files of the varFreqs track were converted in the makeDoc file of the track. < gasp | For some tracks, python scripts were necessary and are also available from GitHub. < gasp |

< gasp | < gasp |

References

< gasp |

< gasp | GenomeAsia100K Consortium. < gasp | < gasp | The GenomeAsia 100K Project enables genetic discoveries across Asia. < gasp | Nature. 2019 Dec;576(7785):106-111. < gasp | PMID: 31802016; PMC: PMC7054211 < gasp |

< gasp | < gaspIndel | html < gaspIndel |

Description

< gaspIndel |

< gaspIndel | The GenomeAsia 100K project aims < gaspIndel | to sequence 100,000 Asian individuals. This pilot release (GAsP) contains whole-genome sequencing < gaspIndel | data of 1,739 individuals from 219 population groups across Asia. Frequencies are broken down by < gaspIndel | Northeast Asian, Southeast Asian, and South Asian ancestry groups. The data is split into two < gaspIndel | subtracks: substitutions and indels. < gaspIndel |

< gaspIndel | < gaspIndel |

Data Access

< gaspIndel |

< gaspIndel | The data can be explored interactively with the < gaspIndel | Table Browser or the < gaspIndel | Data Integrator. < gaspIndel | For programmatic access, our REST API can be used; the < gaspIndel | track name is gasp. < gaspIndel | For bulk download, the VCF file can be obtained from < gaspIndel | our download server. < gaspIndel |

< gaspIndel |

< gaspIndel | The original VCFs are also available from the < gaspIndel | GenomeAsia 100K < gaspIndel | website. No license nor login is required. < gaspIndel |

< gaspIndel | < gaspIndel |

Methods

< gaspIndel |

< gaspIndel | Samples were sequenced on Illumina HiSeq 2500, HiSeq 4000, and HiSeq X Ten instruments with < gaspIndel | 2×100 bp or 2×150 bp paired-end reads at an average depth of 36x. Reads were aligned to < gaspIndel | GRCh37 using BWA-MEM. Duplicate reads were marked with SAMBLASTER and sorted with Sambamba. < gaspIndel | Per-sample variant calling was performed with GATK HaplotypeCaller in GVCF mode, followed by < gaspIndel | joint genotyping with GenotypeGVCFs. Variant quality score recalibration (VQSR) was applied at < gaspIndel | a 99% sensitivity tranche for both SNPs and indels. Sample-level QC included contamination < gaspIndel | checks with verifyBamID and sex concordance verification. The final callset contains < gaspIndel | ∼65 million variants across 1,739 individuals from 219 populations. < gaspIndel |

< gaspIndel |

< gaspIndel | We provide documentation that indicates how all source files of the varFreqs track were converted in the makeDoc file of the track. < gaspIndel | For some tracks, python scripts were necessary and are also available from GitHub. < gaspIndel |

< gaspIndel | < gaspIndel |

References

< gaspIndel |

< gaspIndel | GenomeAsia100K Consortium. < gaspIndel | < gaspIndel | The GenomeAsia 100K Project enables genetic discoveries across Asia. < gaspIndel | Nature. 2019 Dec;576(7785):106-111. < gaspIndel | PMID: 31802016; PMC: PMC7054211 < gaspIndel |

< gaspIndel | 36155,36281d35816 < gnomadStr | html < gnomadStr |

Description

< gnomadStr |

< gnomadStr | The gnomAD STR track displays short tandem repeat (STR) genotypes at 87 < gnomadStr | disease-associated loci from the < gnomadStr | Genome Aggregation < gnomadStr | Database (gnomAD) v3.1.3. The data include individual-level STR genotypes from < gnomadStr | 18,511 whole-genome sequenced samples across 10 populations, aggregated < gnomadStr | into per-locus allele frequency distributions.

< gnomadStr | < gnomadStr |

< gnomadStr | These loci were selected because tandem repeat expansions at these sites have been < gnomadStr | reported to cause human genetic diseases, including Huntington disease (HTT), < gnomadStr | fragile X syndrome (FMR1), Friedreich ataxia (FXN), various < gnomadStr | spinocerebellar ataxias, myotonic dystrophies, and other neurological and < gnomadStr | neuromuscular disorders. Most loci (56) have motifs between 3–6 bp, while < gnomadStr | additional loci have longer motifs of 10–24 bp.

< gnomadStr | < gnomadStr |

< gnomadStr | The genotypes were generated using < gnomadStr | ExpansionHunter < gnomadStr | v5 on gnomAD v3.1 whole-genome sequencing data (150 bp read lengths). Of the < gnomadStr | samples, 64% were PCR-free, 13% PCR-plus, and 23% had unknown PCR protocol. < gnomadStr | ExpansionHunter was selected because it had the best accuracy among existing tools < gnomadStr | for detecting expansions at disease-associated loci. Results were generated without < gnomadStr | off-target regions to minimize overestimation of repeat sizes. < gnomadStr | For each locus, the data show the distribution of repeat allele sizes observed < gnomadStr | across the gnomAD population, providing a reference for normal and expanded allele < gnomadStr | ranges. For more details on the methods, see the < gnomadStr | gnomAD blog post on STR calls.

< gnomadStr | < gnomadStr |

Display Conventions

< gnomadStr |

< gnomadStr | Items are colored by the length of the repeat motif:

< gnomadStr | < gnomadStr | < gnomadStr |

< gnomadStr | Each item is labeled by the gene name. Hovering shows the repeat motif, < gnomadStr | gene, total sample count, and number passing quality filters. Clicking an item < gnomadStr | links to the corresponding gnomAD STR locus page with interactive allele < gnomadStr | frequency histograms and detailed population breakdowns.

< gnomadStr | < gnomadStr |

< gnomadStr | The detail page for each locus shows:

< gnomadStr | < gnomadStr | < gnomadStr |

Methods

< gnomadStr |

< gnomadStr | The gnomAD STR genotype data file < gnomadStr | (gnomAD_STR_genotypes__2025_03_17.tsv.gz) was downloaded from the < gnomadStr | gnomAD downloads page. This file contains individual-level < gnomadStr | STR genotypes at 87 disease-associated loci generated using < gnomadStr | ExpansionHunter < gnomadStr | on gnomAD v3.1.3 whole-genome sequencing data.

< gnomadStr | < gnomadStr |

< gnomadStr | For the UCSC Genome Browser track, the individual genotype records (~1.4 million rows) < gnomadStr | were aggregated per locus to produce summary statistics: total sample count, < gnomadStr | PASS-filter count, allele size frequency distributions, and per-population sample counts. < gnomadStr | Coordinates were used as provided (0-based). Some loci include genotypes for multiple < gnomadStr | motif patterns (e.g., complex repeat structures) and for adjacent repeats; these are < gnomadStr | represented as separate records.

< gnomadStr | < gnomadStr |

< gnomadStr | The 10 populations represented are: African/African American (afr), < gnomadStr | Admixed American/Latino (amr), Amish (ami), Ashkenazi Jewish (asj), < gnomadStr | East Asian (eas), Finnish (fin), Middle Eastern (mid), Non-Finnish European (nfe), < gnomadStr | South Asian (sas), and Other (oth).

< gnomadStr | < gnomadStr |

Data Access

< gnomadStr |

< gnomadStr | The raw data can be explored interactively with the < gnomadStr | Table Browser or the < gnomadStr | Data Integrator. For automated < gnomadStr | analysis, the data may be queried from our < gnomadStr | REST API. The underlying bigBed < gnomadStr | file can be downloaded from our < gnomadStr | download < gnomadStr | server.

< gnomadStr | < gnomadStr |

< gnomadStr | The complete gnomAD STR dataset, including individual-level genotypes, is available < gnomadStr | from the gnomAD downloads page. Interactive locus-level views with < gnomadStr | allele frequency histograms are available at the < gnomadStr | gnomAD STR browser.

< gnomadStr | < gnomadStr |

Credits

< gnomadStr |

< gnomadStr | Thanks to the gnomAD < gnomadStr | production team at the Broad Institute for generating and distributing this data.

< gnomadStr | < gnomadStr |

References

< gnomadStr |

< gnomadStr | Chen S, Francioli LC, Goodrich JK, Collins RL, Kanai M, Wang Q, Alföldi J, < gnomadStr | Watts NA, Vittal C, Gauthier LD et al. < gnomadStr | < gnomadStr | A genome-wide mutational constraint map quantified from variation in 76,156 human < gnomadStr | genomes. < gnomadStr | Nature. 2024;625:92–100. < gnomadStr |

< gnomadStr | < gnomadStr |

< gnomadStr | Dolzhenko E, Deshpande V, Schlesinger F, Krusche P, Petrovski R, Chen S, < gnomadStr | Emez D, Menten B, Narzisi G, Mohiyuddin M et al. < gnomadStr | < gnomadStr | ExpansionHunter: a sequence-graph-based tool to analyze variation in short tandem < gnomadStr | repeat regions. < gnomadStr | Bioinformatics. 2019;35(22):4754–4756. < gnomadStr |

< gnomadStr | 43893,43951d43427 < indigenomes | html < indigenomes |

Description

< indigenomes |

< indigenomes | IndiGenomes provides < indigenomes | whole genome sequencing data of 1,029 healthy Indian individuals under the pilot phase of the < indigenomes | "IndiGen" program. Only the allele frequency is available from this project. The website < indigenomes | also provides SV call and Alu insertion VCFs. < indigenomes |

< indigenomes | < indigenomes |

Data Access

< indigenomes |

< indigenomes | The data can be explored interactively with the < indigenomes | Table Browser or the < indigenomes | Data Integrator. < indigenomes | For programmatic access, our REST API can be used; the < indigenomes | track name is indigenomes. < indigenomes | For bulk download, the VCF file can be obtained from < indigenomes | our download server. < indigenomes |

< indigenomes |

< indigenomes | The original data can also be downloaded from the IndiGen website. < indigenomes |

< indigenomes | < indigenomes |

Methods

< indigenomes |

< indigenomes | Genomic DNA was extracted from 5 ml of peripheral blood collected via venipuncture from < indigenomes | 1,029 self-declared healthy Indian individuals representing diverse geographic, ethnic, and < indigenomes | linguistic groups, using the salting-out method. Whole-genome libraries were prepared using < indigenomes | the TruSeq DNA PCR-free library preparation kit (Illumina). Sequencing was performed on the < indigenomes | Illumina NovaSeq 6000 platform with 150×2 bp paired-end reads targeting ≥30× < indigenomes | mean coverage. Alignment to the GRCh38 reference genome, post-processing, and < indigenomes | default quality-filtered variant calling were performed end-to-end on the Illumina DRAGEN < indigenomes | v3.4 Bio-IT platform, which uses field-programmable gate array (FPGA) logic for < indigenomes | high-throughput processing. This yielded a compendium of 55,898,122 single allelic < indigenomes | genetic variants (SNVs and indels), of which 32.23% were unique to the Indian samples < indigenomes | and absent from global reference databases. Variants were annotated using ANNOVAR with < indigenomes | RefGene, and allele frequencies were cross-referenced against gnomAD v3, 1000 Genomes, < indigenomes | ExAC, ESP6500, and the Greater Middle East Variome Project. The dataset is accessible via < indigenomes | the IndiGenomes database < indigenomes | (Jain, Bhoyar, Scaria, Sivasubbu & the IndiGen Consortium, < indigenomes | Nucleic Acids Research 2021). < indigenomes |

< indigenomes |

< indigenomes | We provide documentation that indicates how all source files of the varFreqs track were converted in the makeDoc file of the track. < indigenomes | For some tracks, python scripts were necessary and are also available from GitHub. < indigenomes |

< indigenomes | < indigenomes |

References

< indigenomes |

< indigenomes | Jain A, Bhoyar RC, Pandhare K, Mishra A, Sharma D, Imran M, Senthivel V, Divakar MK, Rophina M, < indigenomes | Jolly B et al. < indigenomes | < indigenomes | IndiGenomes: a comprehensive resource of genetic variants from over 1000 Indian genomes. < indigenomes | Nucleic Acids Res. 2021 Jan 8;49(D1):D1225-D1232. < indigenomes | PMID: 33095885; PMC: PMC7778947 < indigenomes |

< indigenomes | 48985,49045d48460 < kova | html < kova |

Description

< kova |

< kova | The Korean Variant Archive (KOVA) < kova | contains 1,896 whole genome sequencing and 3,409 whole exome sequencing data from healthy < kova | individuals of Korean ethnicity. Most of the samples originated from normal tissue of cancer < kova | patients (40.16%), healthy parents of rare disease patients (28.4%), or healthy volunteers < kova | (31.44%). Korean ancestry is not broken down further in the INFO field. Coverage 100x for WES, 30x for WGS. < kova | SVs called with Manta are also available. < kova |

< kova | < kova |

Data Access

< kova |

< kova | Due to license restrictions, the data for this track cannot be downloaded from the UCSC < kova | Genome Browser. The Table Browser, Data Integrator, and download server are not available < kova | for this track. < kova |

< kova |

< kova | TSV data can be requested on the KOVA Downloads website. Our GitHub repo contains a script that < kova | converts this format to VCF. < kova |

< kova | < kova |

Methods

< kova |

< kova | Raw reads were aligned to the GRCh38+decoy reference using BWA-MEM v0.7.17 with default < kova | parameters, followed by duplicate marking and coordinate sorting with MarkDuplicatesSpark, and base < kova | quality score recalibration using BQSRPipelineSpark in GATK v4.1.3.0; mapping quality control < kova | metrics were generated with Qualimap v2.2.1. Single-nucleotide variants and small < kova | insertions/deletions were called per sample using GATK HaplotypeCaller in GVCF mode (-ERC GVCF), and < kova | joint genotyping was performed by creating a GenomicsDB with GenomicsDBImport and following GATK < kova | Best Practices, including variant quality score recalibration (VQSR) retaining 99.7% of true SNVs < kova | and 99.0% of true indels based on training sets (workflow detailed in Supplementary Fig. 1). < kova | Downstream analyses followed a modified version of the gnomAD quality-control framework and were < kova | primarily conducted using Hail; after merging WES and WGS data in Hail, multiallelic variants and < kova | variants with genotype quality <20, read depth <10, allelic balance <0.2, or overlapping < kova | low-complexity regions were excluded. < kova |

< kova |

< kova | At UCSC, V7 of the TSV.gz was obtained from the KOVA staff by email and converted to VCF. It is not < kova | available for download from our site but can be requested from the KOVA website. < kova | We provide documentation that indicates how all source files of the varFreqs track were converted in the makeDoc file of the track. < kova | For some tracks, python scripts were necessary and are also available from GitHub. < kova |

< kova | < kova |

Credits

< kova |

< kova | Thanks to Insu Jang and the KOVA director for providing variant frequencies in TSV format. < kova |

< kova | < kova |

References

< kova |

< kova | Lee J, Lee J, Jeon S, Lee J, Jang I, Yang JO, Park S, Lee B, Choi J, Choi BO et al. < kova | < kova | A database of 5305 healthy Korean individuals reveals genetic and clinical implications for an East < kova | Asian population. < kova | Exp Mol Med. 2022 Nov;54(11):1862-1871. < kova | PMID: 36323850; PMC: PMC9628380 < kova |

< kova | 59472,59536d58886 < npm | html < npm |

Description

< npm |

< npm | The National Precision Medicine (NPM) program < npm | in Singapore sequenced 9,770 whole genomes, mostly of Chinese, Indian and Malay ancestry. < npm | A minimum allele count cutoff of >5 was applied. CNV data is also available. < npm |

< npm | < npm |

Data Access

< npm |

< npm | Due to license restrictions, the data for this track cannot be downloaded from the UCSC < npm | Genome Browser. The Table Browser, Data Integrator, and download server are not available < npm | for this track. < npm |

< npm |

< npm | VCF download can be requested on the Chorus Browser website, which requires an < npm | account and data access request. < npm |

< npm | < npm |

Methods

< npm |

< npm | Whole Genome Sequencing (WGS) data processing followed GATK4 best practices. GATK4 germline variant < npm | analysis workflow written in WDL was adapted to use Nextflow and deployed at the National < npm | Supercomputing Centre, Singapore (NSCC). WGS reads were aligned against GRCh38 using the BWA-MEM < npm | algorithm and used as input to GATK HaplotypeCaller to produce single sample gVCFs. The gVCF files < npm | were joint-called then loaded in Hail. Low-quality WGS libraries and low-quality variants were < npm | removed. QC-ed variants were functionally annotated using Ensembl Variant Effect Predictor (VEP) < npm | (version 95). Functional annotations for variants impacting protein-coding regions were also < npm | complemented with information on the potential alteration to their cognate protein's 3D structure < npm | and drug binding ability. < npm |

< npm |

< npm | Our data access request was approved by the NPM data access committee. It can be contacted at contact_npco@a-star.edu.sg. < npm | We downloaded the data from the NPM Chorus browser download section. < npm | We provide documentation that indicates how all source files of the varFreqs track were converted in the makeDoc file of the track. < npm | For some tracks, python scripts were necessary and are also available from GitHub. < npm |

< npm | < npm |

Credits

< npm |

< npm | Thanks to the NPM Data Access Committee and Eleanor for granting our data request. < npm | By browsing the data, you agree to use the data only for academic, non-commercial < npm | research to improve human health (biology/disease). We request all data users < npm | agree to protect the confidentiality of the data subjects in any research papers or publications < npm | that they may prepare, by taking all reasonable care to limit the possibility < npm | of identification. In particular, the data users shall not use, or attempt < npm | to use, the data to deliberately compromise or otherwise infringe the < npm | confidentiality of information on data subjects and their right to privacy. < npm | If you use any of the data obtained from the CHORUS variant browser, we request < npm | that you cite the NPM flagship paper (Wong et al, 2023). All data users of the < npm | data must take note that the data provider and relevant SG10K_Health cohort < npm | owners bear no responsibility for the further analysis or interpretation of the data. < npm |

< npm | < npm |

References

< npm |

< npm | Wong E, Bertin N, Hebrard M, Tirado-Magallanes R, Bellis C, Lim WK, Chua CY, Tong PML, Chua R, Mak K < npm | et al. < npm | < npm | The Singapore National Precision Medicine Strategy. < npm | Nat Genet. 2023 Feb;55(2):178-186. < npm | PMID: 36658435 < npm |

< npm | 68776,68828d68125 < saudi | html < saudi |

Description

< saudi |

< saudi | Variant frequencies from 302 whole genomes at 30x coverage from the < saudi | Saudi Genome Program. The genotyping data and imputations from 3,352 < saudi | individuals do not seem to be available publicly. < saudi |

< saudi | < saudi |

Data Access

< saudi |

< saudi | The data can be explored interactively with the < saudi | Table Browser or the < saudi | Data Integrator. < saudi | For programmatic access, our REST API can be used; the < saudi | track name is saudi. < saudi | For bulk download, the VCF file can be obtained from < saudi | our download server. < saudi |

< saudi |

< saudi | The original data were downloaded from < saudi | Figshare and converted to VCF. < saudi |

< saudi | < saudi |

Methods

< saudi |

< saudi | Whole-genome sequencing of 302 Saudi Arabian individuals was performed on the Illumina HiSeq < saudi | X Ten platform using TruSeq Nano DNA library preparation at 30x target coverage. Sequencing and < saudi | initial bioinformatics processing were carried out by deCODE Genetics (Reykjavík, Iceland). < saudi | Reads were aligned to the GRCh38 reference genome using BWA 0.7.10. Per-sample variant calling < saudi | was performed with GATK HaplotypeCaller, followed by joint genotyping using CombineGVCFs and < saudi | GenotypeGVCFs. Variant quality score recalibration (VQSR) was applied for both SNPs and indels. < saudi | The final autosomal callset contains 25.5 million variants across the 302 individuals. < saudi |

< saudi |

< saudi | The variant data were downloaded from < saudi | Figshare and converted to VCF format using a custom script. < saudi | We provide documentation that indicates how all source files of the varFreqs track were converted in the makeDoc file of the track. < saudi | For some tracks, python scripts were necessary and are also available from GitHub. < saudi |

< saudi | < saudi |

References

< saudi |

< saudi | Malomane DK, Williams MP, Huber CD, Mangul S, Abedalthagafi M, Chiang CWK. < saudi | < saudi | Patterns of population structure and genetic variation within the Saudi Arabian population. < saudi | bioRxiv. 2025 Jan 13;. < saudi | PMID: 39868174; PMC: PMC11761371 < saudi |

< saudi | 68840,68914d68136 < schema | html < schema |

Description

< schema |

< schema | The SCHEMA (Schizophrenia Exome < schema | Meta-Analysis) consortium is an international collaboration that aggregated and harmonized < schema | whole-exome sequencing data to study the role of rare coding variants in schizophrenia. < schema | The dataset includes 24,248 cases and 97,322 controls from diverse global cohorts. < schema | SCHEMA identified genes with exome-wide significant rare variant burden in schizophrenia, < schema | providing insights into the biological underpinnings of the disorder. < schema |

< schema | < schema |

Data Access

< schema |

< schema | Since the data can be downloaded from the SCHEMA website, and does not seem to be under a license, < schema | we assume that we are allowed to redistribute it in VCF format. < schema | The data can be explored on our website interactively with the < schema | Table Browser or the < schema | Data Integrator. < schema | For programmatic access, our REST API can be used; the < schema | track name is schema. < schema | For bulk download, the VCF file can be obtained from < schema | our download server. < schema |

< schema |

< schema | Summary statistics and variant-level results are also available from the < schema | SCHEMA Browser. < schema |

< schema | < schema |

Methods

< schema |

< schema | The SCHEMA (Schizophrenia Exome Meta-Analysis) consortium aggregated whole-exome sequencing < schema | data from 24,248 schizophrenia cases and 97,322 controls (including non-psychiatric, < schema | non-neurological samples from the gnomAD consortium) across multiple international cohorts. < schema | Exome sequencing was performed using various capture platforms and Illumina sequencing < schema | instruments across cohorts sequenced over approximately a decade. Sequence data were < schema | uniformly reprocessed through the BWA-Picard-GATK best practices pipeline as part of the < schema | gnomAD v2 infrastructure, including alignment to GRCh37/hg19, duplicate marking, base < schema | quality score recalibration, and per-sample variant calling with GATK HaplotypeCaller, < schema | followed by joint genotyping across all samples. A novel exon-by-exon coverage estimation < schema | pipeline was developed to account for differences in capture technology across sequencing < schema | batches, and both site-level and genotype-level quality filters were applied. Protein-truncating < schema | variants (PTVs) were annotated using LOFTEE (Loss-Of-Function Transcript Effect Estimator), < schema | and missense variant deleteriousness was scored using MPC (Missense badness, PolyPhen-2, < schema | and Constraint). Gene-level association testing combined: (1) a case-control rare variant < schema | burden test aggregating ultra-rare PTVs (Class I: PTV and MPC > 3; Class II: missense < schema | MPC 2–3) across 18,321 protein-coding genes; and (2) de novo variant enrichment < schema | from 3,402 schizophrenia proband-parent trios assessed via a Poisson rate test against < schema | gnomAD-derived baseline mutation rates; with the two components combined using a weighted < schema | Z-score meta-analysis. This identified 10 genes at exome-wide significance (P < 2.14 < schema | × 10-6) with odds ratios for PTVs ranging from 3 to 50, and 32 genes at < schema | FDR < 5%. Full data are available at < schema | schema.broadinstitute.org < schema | (Singh, Neale, Daly & the SCHEMA Consortium, < schema | Nature 2022). < schema |

< schema |

< schema | We downloaded the TSV data from the SCHEMA website < schema | and converted it to VCF format using a custom Python script. The VCF was lifted to hg38 using our hg19ToHg38 chain < schema | file. < schema | We provide documentation that indicates how all source files of the varFreqs track were converted in the makeDoc file of the track. < schema | For some tracks, python scripts were necessary and are also available from GitHub. < schema |

< schema | < schema |

References

< schema |

< schema | Singh T, Poterba T, Curtis D, Akil H, Al Eissa M, Barchas JD, Bass N, Bigdeli TB, Breen G, < schema | Bromet EJ et al. < schema | < schema | Exome sequencing identifies rare coding variants in 10 genes which confer substantial risk for < schema | schizophrenia. < schema | Nature. 2022 Apr;604(7906):509-516. < schema | PMID: 35396579; PMC: PMC9392855 < schema |

< schema | 69035,69286d68256 < sfariSparkExomes | html < sfariSparkExomes |

Description

< sfariSparkExomes |

< sfariSparkExomes | The Simons Foundation Autism Research < sfariSparkExomes | Initiative (SFARI) recruited a large cohort of families with autistic children who provided < sfariSparkExomes | DNA samples and phenotypes. 54,558 families, parents and their children were sequenced, a total < sfariSparkExomes | of 142,357 individuals with whole-exome (WES) and 12,519 with whole-genome sequencing (WGS). < sfariSparkExomes | The data contains 32,559 trios and 8,895 quads (one sibling without autism), and 824 twins. < sfariSparkExomes |

< sfariSparkExomes | < sfariSparkExomes |

< sfariSparkExomes | The same frequencies shown here are also available publicly on the < sfariSparkExomes | SFARI Genome Browser. < sfariSparkExomes | See (SPARK et al, Neuron 2018) for details. < sfariSparkExomes |

< sfariSparkExomes | < sfariSparkExomes |

Data Access

< sfariSparkExomes |

< sfariSparkExomes | The data can be explored interactively with the < sfariSparkExomes | Table Browser or the < sfariSparkExomes | Data Integrator. < sfariSparkExomes | For programmatic access, our REST API can be used; the < sfariSparkExomes | track name is sfariSparkExomes. < sfariSparkExomes | For bulk download, the VCF file can be obtained from < sfariSparkExomes | our download server. < sfariSparkExomes |

< sfariSparkExomes |

< sfariSparkExomes | Allele frequencies can also be displayed on the < sfariSparkExomes | SFARI Genome Browser. < sfariSparkExomes | Full CRAMs and VCFs with genotypes are available from < sfariSparkExomes | SFARI Base. < sfariSparkExomes | They require a data access request, which is usually reviewed quickly. More information is < sfariSparkExomes | available in the < sfariSparkExomes | SPARK Welcome Packet. < sfariSparkExomes |

< sfariSparkExomes | < sfariSparkExomes |

Methods

< sfariSparkExomes | < sfariSparkExomes |

The genome browser track project was approved by the Simons Foundation under request < sfariSparkExomes | number 14584.1. WES and WGS data were downloaded from < sfariSparkExomes | SFARI Base. < sfariSparkExomes | pVCFs were downloaded, anonymized with a script using bcftools and its "fill-tags" plugin and < sfariSparkExomes | normalized. There was no minimum allele frequency cutoff.

< sfariSparkExomes | < sfariSparkExomes |

The methods are documented as follows by SFARI:

< sfariSparkExomes | < sfariSparkExomes |

< sfariSparkExomes | We provide documentation that indicates how all source files of the varFreqs track were converted in the makeDoc file of the track. < sfariSparkExomes | For some tracks, python scripts were necessary and are also available from GitHub. < sfariSparkExomes |

< sfariSparkExomes | < sfariSparkExomes |

References

< sfariSparkExomes |

< sfariSparkExomes | SPARK Consortium. Electronic address: pfeliciano@simonsfoundation.org, SPARK Consortium. < sfariSparkExomes | < sfariSparkExomes | SPARK: A US Cohort of 50,000 Families to Accelerate Autism Research. < sfariSparkExomes | Neuron. 2018 Feb 7;97(3):488-493. < sfariSparkExomes | PMID: 29420931; PMC: PMC7444276 < sfariSparkExomes |

< sfariSparkExomes | < sfariSparkWgs | html < sfariSparkWgs |

Description

< sfariSparkWgs |

< sfariSparkWgs | The Simons Foundation Autism Research < sfariSparkWgs | Initiative (SFARI) recruited a large cohort of families with autistic children who provided < sfariSparkWgs | DNA samples and phenotypes. 54,558 families, parents and their children were sequenced, a total < sfariSparkWgs | of 142,357 individuals with whole-exome (WES) and 12,519 with whole-genome sequencing (WGS). < sfariSparkWgs | The data contains 32,559 trios and 8,895 quads (one sibling without autism), and 824 twins. < sfariSparkWgs |

< sfariSparkWgs | < sfariSparkWgs |

< sfariSparkWgs | The same frequencies shown here are also available publicly on the < sfariSparkWgs | SFARI Genome Browser. < sfariSparkWgs | See (SPARK et al, Neuron 2018) for details. < sfariSparkWgs |

< sfariSparkWgs | < sfariSparkWgs |

Data Access

< sfariSparkWgs |

< sfariSparkWgs | The data can be explored interactively with the < sfariSparkWgs | Table Browser or the < sfariSparkWgs | Data Integrator. < sfariSparkWgs | For programmatic access, our REST API can be used; the < sfariSparkWgs | track name is sfariSparkExomes. < sfariSparkWgs | For bulk download, the VCF file can be obtained from < sfariSparkWgs | our download server. < sfariSparkWgs |

< sfariSparkWgs |

< sfariSparkWgs | Allele frequencies can also be displayed on the < sfariSparkWgs | SFARI Genome Browser. < sfariSparkWgs | Full CRAMs and VCFs with genotypes are available from < sfariSparkWgs | SFARI Base. < sfariSparkWgs | They require a data access request, which is usually reviewed quickly. More information is < sfariSparkWgs | available in the < sfariSparkWgs | SPARK Welcome Packet. < sfariSparkWgs |

< sfariSparkWgs | < sfariSparkWgs |

Methods

< sfariSparkWgs | < sfariSparkWgs |

The genome browser track project was approved by the Simons Foundation under request < sfariSparkWgs | number 14584.1. WES and WGS data were downloaded from < sfariSparkWgs | SFARI Base. < sfariSparkWgs | pVCFs were downloaded, anonymized with a script using bcftools and its "fill-tags" plugin and < sfariSparkWgs | normalized. There was no minimum allele frequency cutoff.

< sfariSparkWgs | < sfariSparkWgs |

The methods are documented as follows by SFARI:

< sfariSparkWgs | < sfariSparkWgs |

< sfariSparkWgs | We provide documentation that indicates how all source files of the varFreqs track were converted in the makeDoc file of the track. < sfariSparkWgs | For some tracks, python scripts were necessary and are also available from GitHub. < sfariSparkWgs |

< sfariSparkWgs | < sfariSparkWgs |

References

< sfariSparkWgs |

< sfariSparkWgs | SPARK Consortium. Electronic address: pfeliciano@simonsfoundation.org, SPARK Consortium. < sfariSparkWgs | < sfariSparkWgs | SPARK: A US Cohort of 50,000 Families to Accelerate Autism Research. < sfariSparkWgs | Neuron. 2018 Feb 7;97(3):488-493. < sfariSparkWgs | PMID: 29420931; PMC: PMC7444276 < sfariSparkWgs |

< sfariSparkWgs | 88088,88227d87057 < strchive | html < strchive |

Description

< strchive |

< strchive | The STRchive track displays 75 disease-associated short tandem repeat (STR) loci < strchive | curated by the STRchive project. < strchive | STRchive is a dynamic, community-driven resource that compiles population-level and < strchive | locus-specific data for tandem repeat loci implicated in human genetic diseases.

< strchive | < strchive |

< strchive | Tandem repeat expansion disorders are caused by the expansion of short repetitive DNA < strchive | sequences beyond a pathogenic threshold. These expansions can cause a wide range of < strchive | neurological, neuromuscular, and developmental disorders, including Huntington disease, < strchive | fragile X syndrome, Friedreich ataxia, and many forms of spinocerebellar ataxia.

< strchive | < strchive |

< strchive | This track shows the genomic positions of disease-associated STR loci from the STRchive < strchive | catalog, along with the reference and pathogenic repeat motifs, minimum pathogenic repeat < strchive | count thresholds, mode of inheritance, and associated diseases. The data are based on < strchive | the GRCh38/hg38 reference assembly.

< strchive | < strchive |

Display Conventions

< strchive |

< strchive | Items are colored by mode of inheritance:

< strchive | < strchive | < strchive |

< strchive | Each item is labeled by its STRchive locus ID, which combines the disease abbreviation < strchive | and gene symbol (e.g., "HD_HTT" for Huntington disease at the HTT < strchive | gene). Hovering over an item shows the repeat motif, gene, pathogenic threshold, < strchive | and inheritance mode. Clicking an item links to the corresponding < strchive | STRchive locus page with detailed < strchive | clinical and population-level information.

< strchive | < strchive |

Methods

< strchive |

< strchive | The STRchive disease locus catalog was downloaded from the < strchive | STRchive GitHub < strchive | repository (file STRchive-disease-loci.hg38.general.bed). The catalog is < strchive | manually curated by the STRchive team from published literature and contains loci where < strchive | tandem repeat expansions have been reported to cause or be associated with human disease.

< strchive | < strchive |

< strchive | For each locus, the catalog provides:

< strchive | < strchive | < strchive |

< strchive | The BED file was converted to bigBed format for display in the Genome Browser. Coordinates < strchive | were used as provided (0-based half-open BED format).

< strchive | < strchive |

Data Access

< strchive |

< strchive | The raw data can be explored interactively with the < strchive | Table Browser or the < strchive | Data Integrator. For automated < strchive | analysis, the data may be queried from our < strchive | REST API. The underlying bigBed < strchive | file can be downloaded from our < strchive | download < strchive | server.

< strchive | < strchive |

< strchive | The complete STRchive dataset, including additional annotations not shown in this track, < strchive | is available from strchive.org and < strchive | the STRchive GitHub < strchive | repository. The data are released under a < strchive | CC BY 4.0 < strchive | license.

< strchive | < strchive |

Credits

< strchive |

< strchive | Thanks to Harriet Dashnow (University of Colorado), Laurel Hiatt (University of Utah), < strchive | Ben Weisburd (Broad Institute), and the STRchive team for creating and maintaining this < strchive | resource.

< strchive | < strchive |

References

< strchive |

< strchive | Hiatt L, Weisburd B, Dolzhenko E, Rubinetti V, Rehm HL, Gymrek M, Dashnow H. < strchive | < strchive | STRchive: a dynamic resource detailing population-level and locus-specific insights < strchive | at tandem repeat disease loci. < strchive | Genome Med. 2025;17(1):30. < strchive | PMID: 40140942 < strchive |

< strchive | < strVar | html < strVar |

Description

< strVar |

< strVar | Tandem repeats are among the most polymorphic loci in the genome due to high < strVar | rates of repeat unit insertions and deletions caused primarily by polymerase slippage < strVar | during DNA replication. < strVar | The Tandem Repeat Variation track contains a collection of tracks < strVar | displaying population-level genetic variation at tandem repeat loci across < strVar | the human genome. Short tandem repeats (STRs), also known as microsatellites, are consecutive repetitions of < strVar | 1-6 nucleotide motifs. Variable Number Tandem Repeats (VNTRs) are tandem repeats of typically 7-100bp. < strVar |

< strVar | < strVar |

< strVar | This super track provides genome-wide tandem repeat annotations, allele frequency data from < strVar | large-scale population cohorts, and curated disease-associated STR loci.

< strVar | < strVar |

Note that the gnomAD track container also includes an STR variation track, which is not part < strVar | of the container here.

< strVar | < strVar |

Tracks in this collection

< strVar | < strVar | < strVar |

Credits

< strVar |

< strVar | Thanks to the data providers of the individual tracks listed above. < strVar | See each track's documentation page for specific credits.

< strVar | 88333,88402d87162 < swefreq | html < swefreq |

Description

< swefreq |

< swefreq | SweGen provides < swefreq | whole-genome sequencing variant frequencies for 1,000 Swedish individuals. < swefreq | The 1,000 individuals represent a cross-section of the Swedish population and no disease < swefreq | information was used for the selection. The frequency data may therefore include genetic variants < swefreq | that are associated with, or causative of, disease. SweGen also provides SV calls, TEs, MELT < swefreq | results for TEs, HLAs and a FASTA file with new sequence not in hg38. There is < swefreq | also a version for the T2T CHM13 assembly. The full dataset can be browsed at < swefreq | the < swefreq | SweGen Browser. < swefreq |

< swefreq | < swefreq |

Data Access

< swefreq |

< swefreq | Due to license restrictions, the data for this track cannot be downloaded from the UCSC < swefreq | Genome Browser. The Table Browser, Data Integrator, and download server are not available < swefreq | for this track. < swefreq |

< swefreq |

< swefreq | VCF files can be requested at < swefreq | SweGen via a form. The request < swefreq | needs manual approval, which usually is quick. If there is no reply, email SweGen directly. < swefreq |

< swefreq | < swefreq |

Methods

< swefreq |

< swefreq | Fragment size 350bp on a Covaris E220. Paired-end sequencing with 150bp read length was performed < swefreq | on Illumina HiSeq X (HiSeq Control Software 3.3.39/RTA 2.7.1) with v2.5 sequencing chemistry. < swefreq | Raw whole-genome reads were aligned to the GRCh37 reference using BWA-MEM v0.7.12, then sorted and < swefreq | indexed with samtools v0.1.19 and assessed with qualimap v2.2.20; per-sample alignments from < swefreq | multiple lanes and flow cells were merged using Picard MergeSamFiles v1.120. Processing followed < swefreq | GATK best practices with GATK v3.3, including indel realignment (RealignerTargetCreator, < swefreq | IndelRealigner), duplicate marking (Picard MarkDuplicates v1.120), and base quality score < swefreq | recalibration (BaseRecalibrator), producing one finalized BAM per sample. Per-sample gVCFs were < swefreq | generated with GATK HaplotypeCaller v3.3 using reference files from the GATK v2.8 resource bundle, < swefreq | with all steps coordinated via Piper v1.4.0. Joint genotyping of 1,000 samples was performed by < swefreq | merging gVCFs in five batches of 200 using GATK CombineGVCFs, followed by cohort genotyping with < swefreq | GATK GenotypeGVCFs and variant quality score recalibration for SNVs and indels using < swefreq | VariantRecalibrator and ApplyRecalibration. < swefreq |

< swefreq |

< swefreq | At UCSC, the hg38 VCF was downloaded from < swefreq | SweFreq and loaded as-is. < swefreq | The file that we use is swegen_frequencies_fixploidy_GRCh38_20190204.vcf.gz. < swefreq | We provide documentation that indicates how all source files of the varFreqs track were converted in the makeDoc file of the track. < swefreq | For some tracks, python scripts were necessary and are also available from GitHub. < swefreq |

< swefreq | < swefreq |

Credits

< swefreq |

< swefreq | The SweGen allele frequency data was generated by Science for Life Laboratory. < swefreq | Any redistributed data derived from the SweGen data set must follow the SweGen terms and conditions. < swefreq | The data may not be used to attempt to identify any individual in this or other studies. < swefreq | Thanks to the SweGen patients and SciLifeLab for making the data available. < swefreq |

< swefreq | < swefreq |

References

< swefreq |

< swefreq | Ameur A, Dahlberg J, Olason P, Vezzi F, Karlsson R, Martin M, Viklund J, Kähäri AK, < swefreq | Lundin P, Che H et al. < swefreq | < swefreq | SweGen: a whole-genome data resource of genetic variability in a cross-section of the Swedish < swefreq | population. < swefreq | Eur J Hum Genet. 2017 Nov;25(11):1253-1260. < swefreq | PMID: 28832569; PMC: PMC5765326 < swefreq |

< swefreq | 89816,89867d88575 < tommo60kjpn | html < tommo60kjpn |

Description

< tommo60kjpn |

< tommo60kjpn | An allele frequency panel based on short-read whole-genome sequencing analysis of 61,000 Japanese < tommo60kjpn | individuals, produced by the < tommo60kjpn | Tohoku Medical Megabank < tommo60kjpn | Organization (ToMMo) at Tohoku University. The project includes other datatypes such as STRs, < tommo60kjpn | long-read SVs and short-read CNVs. < tommo60kjpn |

< tommo60kjpn | < tommo60kjpn |

Data Access

< tommo60kjpn |

< tommo60kjpn | The data can be explored interactively with the < tommo60kjpn | Table Browser or the < tommo60kjpn | Data Integrator. < tommo60kjpn | For programmatic access, our REST API can be used; the < tommo60kjpn | track name is tommo60kjpn. < tommo60kjpn | For bulk download, the VCF file can be obtained from < tommo60kjpn | our download server. < tommo60kjpn |

< tommo60kjpn |

< tommo60kjpn | The original data can also be downloaded from the jMorp website, specifically the < tommo60kjpn | Downloads section. < tommo60kjpn |

< tommo60kjpn | < tommo60kjpn |

Methods

< tommo60kjpn |

< tommo60kjpn | Genomic DNA was obtained from peripheral blood, saliva, or cord blood samples. Sequencing was < tommo60kjpn | performed on Illumina HiSeq 2500, HiSeq X Five, NovaSeq 6000, and MGI DNBSeq G400/T7 instruments. < tommo60kjpn | Reads were aligned to the GRCh38 reference using BWA 0.7.15 or BWA-mem2 2.1. Alignments underwent < tommo60kjpn | base quality score recalibration (BQSR) with the GATK BaseRecalibrator tool. SNV/indel calling was < tommo60kjpn | performed using GATK HaplotypeCaller, followed by multisample joint genotyping with Sentieon < tommo60kjpn | Genomics tools and variant quality score recalibration (VQSR) filtering. Related samples were < tommo60kjpn | identified and removed using KING 2.3.1, resulting in the final allele frequency panel. < tommo60kjpn |

< tommo60kjpn |

< tommo60kjpn | We provide documentation that indicates how all source files of the varFreqs track were converted in the makeDoc file of the track. < tommo60kjpn | For some tracks, python scripts were necessary and are also available from GitHub. < tommo60kjpn |

< tommo60kjpn | < tommo60kjpn |

References

< tommo60kjpn |

< tommo60kjpn | Tadaka S, Kawashima J, Hishinuma E, Saito S, Okamura Y, Otsuki A, Kojima K, Komaki S, Aoki Y, Kanno < tommo60kjpn | T et al. < tommo60kjpn | < tommo60kjpn | jMorp: Japanese Multi-Omics Reference Panel update report 2023. < tommo60kjpn | Nucleic Acids Res. 2024 Jan 5;52(D1):D622-D632. < tommo60kjpn | PMID: 37930845; PMC: PMC10767895 < tommo60kjpn |

< tommo60kjpn | 89918,89972d88625 < topmed | html < topmed |

Description

< topmed |

< topmed | NHLBI TOPMed (Trans-Omics for Precision < topmed | Medicine) is a program launched by the U.S. National Heart, Lung, and Blood Institute that < topmed | integrates whole-genome sequencing with molecular, clinical, and environmental data from large, < topmed | well-phenotyped cohorts. Its goal is to uncover the biological mechanisms underlying heart, lung, < topmed | blood, and sleep disorders to advance precision medicine and improve population health. Freeze 10 < topmed | contains 868,581,653 variants from 150,899 whole genomes. < topmed |

< topmed | < topmed |

Data Access

< topmed |

< topmed | The data can be explored interactively with the < topmed | Table Browser or the < topmed | Data Integrator. < topmed | For programmatic access, our REST API can be used; the < topmed | track name is topmed. < topmed | For bulk download, the VCF file can be obtained from < topmed | our download server. < topmed |

< topmed |

< topmed | VCFs with summarized allele frequencies are also available from < topmed | the TOPMED BRAVO website. They require a < topmed | login. The VCFs were downloaded from < topmed | BRAVO. < topmed |

< topmed | < topmed |

Methods

< topmed |

< topmed | TOPMed whole genome sequencing was performed at multiple NHLBI-funded sequencing centers < topmed | using PCR-free library preparation with 150 bp paired-end reads on Illumina short-read < topmed | platforms, targeting ≥30x mean coverage. Reads were aligned to the GRCh38 reference genome < topmed | (hs38DH, including decoy sequences) using BWA-MEM, followed by duplicate marking with < topmed | Picard MarkDuplicates and base quality score recalibration (BQSR) with GATK. Variant calling < topmed | was performed using the TOPMed GotCloud pipeline (developed at the Center for Statistical < topmed | Genetics, University of Michigan), comprising: (1) per-sample candidate variant detection with < topmed | vt discover2 and normalization with vt normalize; (2) cross-sample variant site < topmed | consolidation using cramore vcf-merge-candidate-variants; (3) joint genotyping across all < topmed | samples; and (4) variant filtering using a Support Vector Machine (SVM) classifier < topmed | (libsvm) trained on positive labels derived from HapMap 3.3 and 1000 Genomes Omni2.5 < topmed | array sites, and negative labels derived from Mendelian-inconsistent variants identified < topmed | within the cohort's pedigree structure using vt milk-filter. Sample-level quality < topmed | control included estimation of DNA contamination, genetic ancestry, and biological sex < topmed | using cramore cram-verify-bam (verifyBamID2) and relative X/Y chromosomal depth. Full < topmed | methods for TOPMed freeze 10 are available on the < topmed | TOPMed WGS Methods page. < topmed |

< topmed | < topmed |

< topmed | We provide documentation that indicates how all source files of the varFreqs track were converted in the makeDoc file of the track. < topmed | For some tracks, python scripts were necessary and are also available from GitHub. < topmed |

< topmed | 90984,91104d89636 < trexplorer | html < trexplorer |

Description

< trexplorer |

< trexplorer | The TRExplorer track displays 5,599,658 tandem repeat (TR) loci from the < trexplorer | TRExplorer < trexplorer | catalog. Tandem repeats are adjacent copies of a short DNA sequence motif; they include < trexplorer | short tandem repeats (STRs, motifs of 1–6 bp) and variable number tandem repeats < trexplorer | (VNTRs, longer motifs). TRs are among the most polymorphic and mutationally active loci < trexplorer | in the human genome, contributing to gene expression variation, complex disease risk, < trexplorer | and over 60 known Mendelian disorders.

< trexplorer | < trexplorer |

< trexplorer | The catalog integrates loci from multiple sources, including perfect repeats in the < trexplorer | reference genome, polymorphic TRs discovered in T2T assemblies and the Illumina 174k < trexplorer | cohort, HipSTR catalog loci, and curated disease-associated repeat expansions. Each < trexplorer | locus is annotated with repeat purity, gene context, disease associations, and < trexplorer | population allele frequency data from up to three cohorts.

< trexplorer | < trexplorer |

Display Conventions

< trexplorer |

< trexplorer | Items are colored by the length of the repeat motif (period):

< trexplorer | < trexplorer | < trexplorer |

< trexplorer | Items are labeled by the repeat motif sequence (truncated with “..” for < trexplorer | motifs longer than 25 characters). The BED score reflects repeat purity (0–1000). < trexplorer | Hovering over an item shows the full motif, motif size, number of reference copies, < trexplorer | repeat purity, gene annotation, and data source.

< trexplorer | < trexplorer |

< trexplorer | Clicking an item opens the details page, which includes a link to the corresponding < trexplorer | TRExplorer locus < trexplorer | page with interactive allele frequency visualizations.

< trexplorer | < trexplorer |

Population Frequency Data

< trexplorer |

< trexplorer | Allele frequency histograms are available for two cohorts where genotyping was < trexplorer | performed:

< trexplorer | < trexplorer |

< trexplorer | For each cohort, two parallel fields store allele sizes (in repeat copy numbers) and < trexplorer | their corresponding counts, preserving the original order for histogram visualization. < trexplorer | Summary allele counts are also available for the AoU1027 cohort (1,027 HiFi PacBio samples from the All of Us Research Program genotyped using TRGT-LPS).

< trexplorer | < trexplorer |

Data Sources

< trexplorer |

< trexplorer | Loci in this catalog were compiled from multiple sources:

< trexplorer | < trexplorer | < trexplorer |

Methods

< trexplorer |

< trexplorer | The TRExplorer catalog was built by merging tandem repeat annotations from multiple < trexplorer | reference-based and population-based discovery approaches. For each locus, the repeat < trexplorer | motif, copy number, and purity were determined from the GRCh38 reference sequence. < trexplorer | Gene annotations were derived from MANE Select transcripts (with fallback to Gencode). < trexplorer | Population allele frequencies were obtained by genotyping large cohorts using < trexplorer | ExpansionHunter and other TR genotyping tools.

< trexplorer | < trexplorer |

< trexplorer | For the UCSC Genome Browser track, the source catalog (TSV format) was converted to < trexplorer | bigBed format. Coordinates in the source data are already 0-based half-open (BED < trexplorer | convention). Allele frequency histograms were split into parallel size and count fields < trexplorer | to facilitate visualization. Items were colored by motif period using the same scheme as < trexplorer | the WebSTR track.

< trexplorer | < trexplorer |

Data Access

< trexplorer |

< trexplorer | The raw data can be explored interactively with the < trexplorer | Table Browser or the < trexplorer | Data Integrator. For automated < trexplorer | analysis, the data may be queried from our < trexplorer | REST API. The underlying bigBed < trexplorer | file can be downloaded from our < trexplorer | download < trexplorer | server.

< trexplorer | < trexplorer |

< trexplorer | The complete TRExplorer dataset and interactive tools are available from the < trexplorer | TRExplorer web < trexplorer | portal at the Broad Institute.

< trexplorer | < trexplorer |

Credits

< trexplorer |

Thanks to Ben Weisburd, Egor Dolzhenko, and the TRExplorer team for making these data available.

< trexplorer | < trexplorer |

References

< trexplorer |

< trexplorer | Ben Weisburd, Egor Dolzhenko, Mark F. Bennett, Matt C. Danzi, Isaac R. L. Xu, Hope Tanudisastro, Bida Gu, Adam English, Laurel Hiatt, Tom Mokveld, Guilherme De Sena Brandine, Readman Chiu, Nehir Edibe Kurtas, Helyaneh Ziaei Jam, Harrison Brand, Indhu Shree Rajan Babu, Melanie Bahlo, Mark JP Chaisson, Stephan Züchner, Melissa Gymrek, Harriet Dashnow, Michael A. Eberle, Heidi L. Rehm < trexplorer | < trexplorer | TRExplorer: A comprehensive catalog of tandem repeat variation in the human genome. < trexplorer | bioRxiv. 2024. < trexplorer | doi: 10.1101/2024.10.04.615514 < trexplorer |

< trexplorer | 93261,93690d91792 < varFreqs | html < varFreqs |

Description

< varFreqs |

< varFreqs | This supertrack collects variant allele frequencies from population-scale sequencing and < varFreqs | genotyping projects worldwide, from a total of ~1.7 million genomes/exomes/arrays. < varFreqs | The data was not reprocessed in a harmonized way but the variant VCFs were collected from the projects. < varFreqs | The goal is to provide a single place to compare how common < varFreqs | a variant is across different populations, ancestries, and cohorts, for < varFreqs | projects that cannot be recomputed by gnomAD soon. The main < varFreqs | combined track merges all databases into one single summary track, < varFreqs | with filters, summed population frequencies and recalculated protein-effect annotations. < varFreqs | In addition, there is one subtrack per project with the original VCF data and all the annotations that the project provides. < varFreqs | The different projects use different pipelines and sequencing technologies, click any of the projects < varFreqs | above or below for a summary of their sample selection, sequencing assay and software pipeline. < varFreqs | Many projects do not allow us to distribute the data but we document how the < varFreqs | data can be requested and provide all converters.

< varFreqs | < varFreqs |

< varFreqs | Data from projects that provide haplotype-phased genotypes can also be found < varFreqs | elsewhere: 1000 Genomes is also a separate track, and the phased genotypes HGDP, SGDP, < varFreqs | HGDP+1000 Genomes and Mexico Biobank can also be found in the "Phased Variants" track. < varFreqs | Their VCF versions below show only the isolate frequency per variant. < varFreqs |

< varFreqs | < varFreqs |

Please contact us (genome@soe.ucsc.edu), if you know a project that we should add. So far, < varFreqs | we already requested these: UK Biobank (pending for a year), < varFreqs | Regeneron's Million Exomes and Mexico City Studies (request rejected), Taiwan Biobank (pending). < varFreqs |

< varFreqs | < varFreqs |

Combined Track (All Databases)

< varFreqs |

< varFreqs | The "All Databases Combined" track merges variants from all individual databases into a single < varFreqs | bigBed file with consequence annotations, a total of more than 1.2 billion variants from 1.7 mil individuals. < varFreqs | The track supports filtering by variant type < varFreqs | (SNV, insertion, deletion, MNV), predicted consequence (missense, synonymous, stop gained, < varFreqs | frameshift, splice, intron, intergenic), source database, allele frequency (overall maximum < varFreqs | and per-database), and allele count (total or per-database). This track is either useful in dense mode < varFreqs | for getting a quick overview of variant density across all projects, or with filters to find < varFreqs | variants present in specific databases or within certain frequency ranges. Note that with the "clone track" < varFreqs | feature you can clone this track and have multiple versions, each with different filters activated. < varFreqs | You can also use our "Density mode" checkbox on the track configuration page to show a plot with the < varFreqs | density of variants passing a filter, one per track clone. < varFreqs |

< varFreqs | < varFreqs |

Available Datasets

< varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs | < varFreqs |
DatabaseRegionNData TypeCohortSub-populationsDownloadable from UCSC
All Databases combinedAll below1.7milWGS/WES/imputedNo
AllOfUs v7USA245kWGSGeneral population, diverseEuropean, East Asian, African, Indigenous American, Oceanian, South AsianYes
TOPMED Freeze 10USA151kWGSHeart, lung, blood, sleep disorder cohortsYes
SFARI SPARK WESUSA140kWESAutism families (parents + affected children)No
SFARI SPARK WGSUSA12.5kWGSAutism families (parents + affected children)No
NCBI ALFA R4USA408kWGS/WES/array mixAggregated dbGaP studies, mixed phenotypesYes
FinnGen R12Finland500kImputed (8.5k WGS ref panel)National biobank, ~10% of populationYes
SweGenSweden1kWGSCross-section of Swedish populationNo
SCHEMAMulti-national121kWESSchizophrenia: 24k cases, 97k controlsYes
Japan ToMMO 61kJapan61kWGSGeneral populationYes
Australia MGRBAustralia4kWGSHealthy elderly (age ≥70)No
GenomeAsia PilotAsia (219 groups)1.7kWGSDiverse populations across AsiaNortheast Asian, Southeast Asian, South AsianYes
ABraOM BrazilBrazil1.2kWGSElderly admixed individuals (São Paulo)Yes
IndiGenomesIndia1kWGSHealthy individualsYes
KOVA KoreaKorea5.3k1.9k WGS + 3.4k WESNormal tissue from cancer patients, healthy parents, volunteersNo
NPM SingaporeSingapore9.8kWGSChinese, Indian, Malay ancestryNo
Saudi GenomeSaudi Arabia302WGS (30x)Saudi populationYes
HRCMulti-national~30kLow-coverage WGS (7x)Imputation reference panel (excl. 1000 Genomes)Yes
MXB Mexico BiobankMexico6kGenotyping arrayDiverse Mexican ancestries, 898 recruitment sitesBy state, by ancestryNo
SGDPGlobal279WGS142 diverse populations worldwideBy populationYes
GREGoR R4USA3.6kWGSRare disease families (10.7k participants, 4.4k families)No
gnomAD HGDP+1kGGlobal4kWGS80 populations (HGDP + 1000 Genomes reprocessed)80 populations, continental groupsYes
< varFreqs | < varFreqs |

Display Conventions

< varFreqs | < varFreqs |

Most tracks only show the variant and allele frequencies on mouseover or clicks. < varFreqs | When zoomed in, tracks display alleles with base-specific coloring. Homozygote < varFreqs | data are shown as one letter, while heterozygotes will be displayed with both < varFreqs | letters. All VCF files are normalized, with one single allele per annotation (no multi-allele < varFreqs | lines). < varFreqs |

< varFreqs | < varFreqs |

Data Access

< varFreqs |

All the data is publicly available. The table above indicates if we are allowed to distribute it in VCF format. Most of the databases do not allow us to redistribute the data files directly from our website, but it can always be downloaded from the original websites in some form. Click the database link in the table above and see the "Data Access" section of the respective track for a description of where to download the data. When the data is freely available from our website, the Data Access section will also indicate the VCF file location on our download server. Because it contains some licensed data, the combined track is not available for download, but can be recreated using the conversion scripts in our Github repository and the accompanying documentation file. < varFreqs |

< varFreqs | < varFreqs |

Credits

< varFreqs | < varFreqs |

This track is only possible thanks to the data from millions of volunteers around the world, who donated blood, signed consent forms and provided health information about themselves and sometimes their families. Click on any of the tracks in the list above to see the specific credits for each project. Thanks to Alex Ioannidis, UCSC, for the motivation for this track and to Andreas Lahner, MGZ, for feedback.

< varFreqs | < varFreqsAll | html < varFreqsAll |

Description

< varFreqsAll |

< varFreqsAll | This track merges variants from all individual variant frequency databases into a single < varFreqsAll | bigBed file with predicted protein consequences and cross-database filtering. It contains < varFreqsAll | over 1.1 billion variants from 20 population databases worldwide. For a summary of < varFreqsAll | all available databases, see the < varFreqsAll | Variant Frequencies supertrack page. < varFreqsAll |

< varFreqsAll | < varFreqsAll |

< varFreqsAll | Each variant is annotated with its predicted consequence on protein-coding genes < varFreqsAll | (using bcftools csq with < varFreqsAll | Ensembl < varFreqsAll | gene models), and colored by severity. < varFreqsAll | Allele counts and frequencies are shown for each source database and, where available, < varFreqsAll | broken down by ancestry or population group. < varFreqsAll |

< varFreqsAll | < varFreqsAll |

Display Conventions

< varFreqsAll | < varFreqsAll |

Color by Consequence

< varFreqsAll |

Variants are colored by their most severe predicted consequence:

< varFreqsAll | < varFreqsAll | < varFreqsAll | < varFreqsAll | < varFreqsAll | < varFreqsAll | < varFreqsAll | < varFreqsAll | < varFreqsAll | < varFreqsAll | < varFreqsAll | < varFreqsAll | < varFreqsAll | < varFreqsAll | < varFreqsAll | < varFreqsAll | < varFreqsAll | < varFreqsAll | < varFreqsAll | < varFreqsAll | < varFreqsAll | < varFreqsAll | < varFreqsAll |
ColorConsequence classExamples
RedProtein-truncating / Loss-of-functionstop_gained, frameshift, splice_donor, splice_acceptor, stop_lost, start_lost
BlueMissense / In-framemissense, inframe_insertion, inframe_deletion, protein_altering
GreenSynonymoussynonymous, stop_retained
GreyNon-coding / Intergenicintron, non_coding, intergenic, UTR
< varFreqsAll | < varFreqsAll |

Amino Acid Change Notation

< varFreqsAll |

< varFreqsAll | The "AA change" field uses bcftools csq notation: 23I>23V means position < varFreqsAll | 23 changed from Isoleucine (I) to Valine (V) (missense). 23I alone (no arrow) < varFreqsAll | means position 23 is Isoleucine and unchanged (synonymous). A "*" indicates a < varFreqsAll | stop codon (e.g. 45R>45* is a stop_gained). < varFreqsAll |

< varFreqsAll | < varFreqsAll |

Filters

< varFreqsAll |

< varFreqsAll | This track supports extensive filtering via the track settings page. Click on the track < varFreqsAll | title or use the "Configure" button to access filters: < varFreqsAll |

< varFreqsAll | < varFreqsAll |

Variant Type and Consequence

< varFreqsAll | < varFreqsAll | < varFreqsAll |

How to find protein-truncating variants: Set the Consequence filter to include < varFreqsAll | only "Stop Gained", "Frameshift", "Splice Donor", and < varFreqsAll | "Splice Acceptor". These will appear as red items in the track display.

< varFreqsAll | < varFreqsAll |

Frequency and Count Filters

< varFreqsAll | < varFreqsAll | < varFreqsAll |

Source Database

< varFreqsAll |

< varFreqsAll | The Source Database filter lets you restrict to variants present in specific databases. < varFreqsAll | For example, select only "GREGoR" to see variants found in the rare disease cohort. < varFreqsAll | This filter uses OR logic: selecting multiple databases shows variants found in < varFreqsAll | any of the selected databases. < varFreqsAll |

< varFreqsAll | < varFreqsAll |

Population-Specific Filters

< varFreqsAll |

< varFreqsAll | Several databases provide ancestry-specific allele frequencies: < varFreqsAll |

< varFreqsAll | < varFreqsAll | < varFreqsAll |

Length Filters

< varFreqsAll | < varFreqsAll | < varFreqsAll |

Methods

< varFreqsAll |

< varFreqsAll | Variant frequency VCF files from 20 databases were stripped of their INFO fields < varFreqsAll | (to reduce size), normalized with bcftools norm (splitting multi-allelic sites), < varFreqsAll | and merged with bcftools merge. The merged VCF was then annotated with predicted < varFreqsAll | protein consequences using bcftools csq with the < varFreqsAll | Ensembl < varFreqsAll | GRCh38 release 115 gene annotation (GFF3). < varFreqsAll |

< varFreqsAll | < varFreqsAll |

< varFreqsAll | The annotated VCF was converted to bigBed format using a custom Python script < varFreqsAll | (vcfToBigBed.py) that reads frequency data from each source VCF in parallel, < varFreqsAll | matches variants by position/ref/alt, and writes a BED file with consequence coloring, < varFreqsAll | per-database allele counts and frequencies, and population breakdowns. < varFreqsAll | The database configuration (which VCFs to include, field mappings, and population definitions) < varFreqsAll | is stored in two TSV files < varFreqsAll | (databases.tsv and < varFreqsAll | populations.tsv) < varFreqsAll | to make future updates easy. < varFreqsAll |

< varFreqsAll | < varFreqsAll |

< varFreqsAll | We provide documentation that indicates how all source files of the varFreqs track were < varFreqsAll | converted in the < varFreqsAll | makeDoc file of the track. < varFreqsAll | Scripts are available from < varFreqsAll | Github. < varFreqsAll |

< varFreqsAll | < varFreqsAll |

Credits

< varFreqsAll |

< varFreqsAll | This track is only possible thanks to the data from millions of volunteers around the world, < varFreqsAll | who donated blood, signed consent forms and provided health information about themselves and < varFreqsAll | sometimes their families. Click on any of the individual tracks in the < varFreqsAll | Variant Frequencies supertrack to see the specific < varFreqsAll | credits for each project. Thanks to Alex Ioannidis, UCSC, for the motivation for this track < varFreqsAll | and to Andreas Lahner, MGZ, for feedback. < varFreqsAll |

< varFreqsAll | 94061,94177d92162 < webstr | html < webstr |

Description

< webstr |

< webstr | The WebSTR track displays 1,710,833 short tandem repeat (STR) loci across the < webstr | human genome from the < webstr | WebSTR database. STRs (also known < webstr | as microsatellites) are consecutive repetitions of 1–6 nucleotide motifs that are < webstr | highly polymorphic due to repeat unit insertions and deletions caused primarily by < webstr | polymerase slippage during replication. Genetic variation at STRs has been shown to < webstr | influence gene expression, cancer risk, and neurodevelopmental traits.

< webstr | < webstr |

< webstr | This track is based on the EnsembleTR panel for the GRCh38/hg38 assembly, < webstr | which represents a combined set of tandem repeats genotyped by four separate methods < webstr | (HipSTR, GangSTR, ExpansionHunter, and AdVNTR) on data from the < webstr | 1000 Genomes Project < webstr | and H3Africa. < webstr | EnsembleTR < webstr | was applied to jointly genotype all 3,550 samples, producing consensus calls at < webstr | over 1.7 million autosomal tandem repeat loci.

< webstr | < webstr |

< webstr | The track includes allele frequency distributions for five 1000 Genomes continental < webstr | populations:

< webstr | < webstr | < webstr |

< webstr | For each population, allele frequencies are defined as the number of copies of each allele < webstr | divided by the total number of alleles in that population. Alleles are represented as < webstr | the number of repeat unit copies.

< webstr | < webstr |

Display Conventions

< webstr |

< webstr | Items are colored by the length of the repeat motif (period):

< webstr | < webstr | < webstr |

< webstr | Each item is labeled by its WebSTR repeat ID. Hovering over an item shows the repeat < webstr | motif, number of reference copies, and motif period. Clicking an item links to the < webstr | corresponding < webstr | WebSTR locus page, which provides < webstr | interactive allele frequency histograms and additional annotations.

< webstr | < webstr |

Methods

< webstr |

< webstr | The EnsembleTR reference panel was constructed as follows:

< webstr |
    < webstr |
  1. Tandem repeat reference sets from four genotyping tools (HipSTR, GangSTR, < webstr | ExpansionHunter, and AdVNTR) were merged.
  2. < webstr |
  3. Each tool was run independently on 1000 Genomes and H3Africa whole-genome < webstr | sequencing data.
  4. < webstr |
  5. EnsembleTR < webstr | was used to produce joint consensus genotype calls across all four methods.
  6. < webstr |
  7. Loci called in fewer than 75% of samples were removed, yielding 1,710,833 loci.
  8. < webstr |
  9. Allele frequencies were computed per population.
  10. < webstr |
< webstr | < webstr |

< webstr | For the UCSC Genome Browser track, the source data were converted from CSV to bigBed < webstr | format. The 1-based start coordinates from the WebSTR database were converted to 0-based < webstr | half-open coordinates for the BED format. Per-population allele frequency distributions < webstr | are stored as extra bigBed fields.

< webstr | < webstr |

Data Access

< webstr |

< webstr | The raw data can be explored interactively with the < webstr | Table Browser or the < webstr | Data Integrator. For automated < webstr | analysis, the data may be queried from our < webstr | REST API. The underlying bigBed < webstr | file can be downloaded from our < webstr | download < webstr | server.

< webstr | < webstr |

< webstr | The complete WebSTR dataset, including additional cohorts and data types not included in < webstr | this track, is available from the < webstr | WebSTR web portal. Programmatic < webstr | access to the full WebSTR database is available through the < webstr | WebSTR REST API.

< webstr | < webstr |

Credits

< webstr |

< webstr | Thanks to Melissa Gymrek (UC San Diego), Oxana Sachenkova Lundström < webstr | (Stockholm University / ZHAW), and the WebSTR team for providing the data for this track.

< webstr | < webstr |

References

< webstr |

< webstr | Sachenkova Lundström O, Adriaan Verbiest M, Xia F, Jam HZ, Zlobec I, < webstr | Anisimova M, Gymrek M. < webstr | < webstr | WebSTR: A Population-wide Database of Short Tandem Repeat Variation in Humans. < webstr | J Mol Biol. 2023 Oct 15;435(20):168260. < webstr | PMID: 37678708 < webstr |

< webstr | < webstr |

< webstr | Jam HZ, Revoir P, Gadgil R, Sun Y, Gymrek M. < webstr | < webstr | EnsembleTR: a tool for combining tandem repeat genotyping results. < webstr | Nat Biotechnol. 2024. < webstr |

< webstr |