aa61ebc800429515f9ced7e28f669c6042219f43 max Wed Mar 18 09:09:13 2026 -0700 varFreqs supertrack: add GREGoR track, update all HTML docs, move scripts to varFreqs/, refs #36642 Add GREGoR R04 WGS track to varFreqs superTrack. Update Data Access and Methods sections for all 20+ subtrack HTML files with consistent formatting, sequencing methods from source papers, and links to makeDoc and Github scripts. Move all varFreqs conversion scripts into scripts/varFreqs/ subdirectory and update makeDoc paths accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> diff --git src/hg/makeDb/trackDb/human/sfariSparkExomes.html src/hg/makeDb/trackDb/human/sfariSparkExomes.html new file mode 100644 index 00000000000..e32c9ace340 --- /dev/null +++ src/hg/makeDb/trackDb/human/sfariSparkExomes.html @@ -0,0 +1,124 @@ +<h2>Description</h2> +<p> +The <a href="https://sparkforautism.org/" target="_blank">Simons Foundation Autism Research +Initiative (SFARI)</a> recruited a large cohort of families with autistic children who provided +DNA samples and phenotypes. 54,558 families, parents and their children were sequenced, a total +of 142,357 individuals with whole-exome (WES) and 12,519 with whole-genome sequencing (WGS). +The data contains 32,559 trios and 8,895 quads (one sibling without autism), and 824 twins. +</p> + +<p> +The same frequencies shown here are also available publicly on the +<a href="https://genomes.sfari.org/" target="_blank">SFARI Genome Browser</a>. +See (SPARK et al, Neuron 2018) for details. +</p> + +<h2>Data Access</h2> +<p> +The data can be explored interactively with the +<a href="../cgi-bin/hgTables">Table Browser</a> or the +<a href="../cgi-bin/hgIntegrator">Data Integrator</a>. +For programmatic access, our <a href="https://api.genome.ucsc.edu">REST API</a> can be used; the +track name is <em>sfariSparkExomes</em>. +For bulk download, the VCF file can be obtained from +<a href="http://hgdownload.soe.ucsc.edu/gbdb/$db/varFreqs/" target="_blank">our download server</a>. +</p> +<p> +Allele frequencies can also be displayed on the +<a href="https://genomes.sfari.org/" target="_blank">SFARI Genome Browser</a>. +Full CRAMs and VCFs with genotypes are available from +<a href="https://base.sfari.org/" target="_blank">SFARI Base</a>. +They require a data access request, which is usually reviewed quickly. More information is +available in the +<a href="https://cohorts-cdn.simonsfoundation.org/spark/researcher_packets/SPARK_SFARI_Researcher_Welcome_Packet.pdf" +target="_blank">SPARK Welcome Packet</a>. +</p> + +<h2>Methods</h2> + +<p>The genome browser track project was approved by the Simons Foundation under request +number 14584.1. WES and WGS data were downloaded from +<a href="https://base.sfari.org/" target="_blank">SFARI Base</a>. +pVCFs were downloaded, anonymized with a script using bcftools and its "fill-tags" plugin and +normalized. There was no minimum allele frequency cutoff.</p> + +<p>The methods are documented as follows by SFARI:</p> +<ul> + <li> + <b>WGS</b>: + This release consists of sequence and variant call data for 12,519 + unique individuals, of which 12,517 (99.98%) have available genome-wide + SNP genotype data. Sequencing and genotyping of all samples in this + release was performed at New York Genome Center (NYGC). DNA from saliva + samples were extracted and prepared with PCR-free methods and sequenced + with paired-end sequencing of 150 bases on the Illumina NovaSeq 6000 + system. Alignment of reads to the human reference genome version + GRCh38, duplicate read marking, and Base Quality Score Recalibration + (BQSR) were performed by New York Genome Center (NYGC). Whole-genome + sequencing data were processed using a standardized, functionally + equivalent CCDG pipeline with alignment to the GRCh38DH (1000 Genomes) + reference using BWA-MEM v0.7.15 (deterministic settings, no -M, use of + .alt contigs), Picard-equivalent duplicate marking (Picard ≥2.4.1 or + equivalent), no indel realignment, and base quality score recalibration + with GATK (dbSNP138, Mills and 1000G gold-standard indels, known + indels). Final outputs were stored as lossless CRAM files with + complete SAM-compliant read-group annotations and mandatory 4-bin + base-quality compression (Q2—6, 10, 20, 30), and all implementations + were validated for functional equivalence across centers before use. + Variant Calling was performed using DeepVariant. See + <a href="https://github.com/CCDG/Pipeline-Standardization/blob/master/PipelineStandard.md" + target="_blank">CCDG pipeline details</a>. + </li> + <li> + <b>WES</b>: This release contains + sequence data for 142,357 individuals and genotyping data for + 141,368 individuals. DNA was sequenced from saliva for all + samples and all participants consented to having their genetic + data shared by Regeneron. Exomes for all samples were sequenced with + short-read, paired-end sequencing of 150 bases on Illumina + NovaSeq 6000 machines using S2/S4 flow cells. Sequencing and + genotyping was performed across nine batches (WES1 through + WES9) at the Regeneron Genetics Center (RGC) and integrated + together for this data release. All sequencing batches were + processed using the same DNA extraction methods and sequencing + machines, however two different exome capture panels were used, + as described below. Genotyping was performed using a SNP + genotyping array for WES1 through WES4 and using + "genotyping-by-sequencing" (GxS) for WES5 through WES9. The + first four sequencing batches were sequenced at Regeneron using + custom NEB/Kapa reagents with the IDT (Integrated DNA + Technologies) xGen capture platform, including custom exome + capture regions. Samples starting with batch WES5 were + sequenced using the Twist Bioscience Human + Comprehensive Exome panel, combined with spike-ins for + sequencing genotyping sites (see Genotyping Methods), the full + mitochondrial genome, and coverage boosted at selected sites + for assaying clonal hematopoiesis of indeterminate potential + (CHIP). SFARI performed SNV/indel calling via DeepVariant and + GATK to generate gVCFs, pairwise relatedness inferred using + PLINK v1.9 IBD estimates from common SNPs (AF ≥ 0.01, dbSNP + v151) with ≥15% relatedness flagged, and comprehensive + individual- and family-level quality control executed using the + internal GenomeCheckMate pipeline to exclude samples based on + contamination (≥5%), insufficient coverage (<20x in <80% of + targets), sex discordance, pedigree/IBD inconsistencies, + unregistered relationships, unexpected duplicates, or excess + relatedness, after which QC-passing individuals (selecting the + most recent passing sample per person) were retained for + variant calling and joint genotyping. + </li> +</ul> +<p> +We provide documentation that indicates how all source files of the varFreqs track were converted in the <a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/hg38/varFreqs.txt" target=_blank>makeDoc file</a> of the track. +For some tracks, python scripts were necessary and are also available from <a href="https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/scripts/varFreqs" target=_blank>Github</a>. +</p> + +<h2>References</h2> +<p> +SPARK Consortium. Electronic address: pfeliciano@simonsfoundation.org, SPARK Consortium. +<a href="https://linkinghub.elsevier.com/retrieve/pii/S0896-6273(18)30018-7" target="_blank"> +SPARK: A US Cohort of 50,000 Families to Accelerate Autism Research</a>. +<em>Neuron</em>. 2018 Feb 7;97(3):488-493. +PMID: <a href="https://www.ncbi.nlm.nih.gov/pubmed/29420931" target="_blank">29420931</a>; PMC: <a +href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7444276/" target="_blank">PMC7444276</a> +</p>