3d9187d264d00ee8e681521bc2c942ee2527d4f1 max Wed May 13 07:33:38 2026 -0700 varFreqs: add WBBC (Westlake BioBank for Chinese) subtrack from the Phase I v20210103 release: 4,480 WGS samples, 78.6M variants, per-region frequencies for the 4 Han Chinese geographic subgroups (North/Central/South/Lingnan). databases.tsv + populations.tsv updated for the next varFreqsAll rebuild. refs #36642 Co-Authored-By: Claude Opus 4.7 (1M context) diff --git src/hg/makeDb/trackDb/human/wbbc.html src/hg/makeDb/trackDb/human/wbbc.html new file mode 100644 index 00000000000..520409f384c --- /dev/null +++ src/hg/makeDb/trackDb/human/wbbc.html @@ -0,0 +1,104 @@ +

Description

+

+This track shows allele frequencies for 78.6 million variants from +4,480 whole-genome-sequenced Chinese individuals released by the +Westlake BioBank for Chinese +(WBBC) pilot project. The WBBC is a population study of around 35,000 +Chinese volunteers spanning 31 provinces; about 15,000 of them have +been deeply phenotyped and a subset have been whole-genome sequenced. +The frequencies are also broken down into four Han Chinese regional +groups (North, Central, South, Lingnan) defined by recruitment province +in the WBBC paper. +

+ +

+The pilot project has been folded into the larger +China Precision BioBank +(CPBB) initiative, which is collecting up to 100,000 samples +nationwide. The variant frequencies on this track are from the original +WBBC Phase I release (v20210103) and are unchanged by the rebranding. +

+ +

Display

+

+The track uses the standard UCSC VCF display. Hovering a variant shows +the cohort allele frequency, the four regional frequencies, sequencing +depth, GATK VQSR log-odds score, and the per-genotype hom-ref / het / +hom-alt sample counts as reported by WBBC. +

+ +

Methods

+

+The WBBC pilot whole-genome-sequenced 4,535 individuals at a mean depth +of around 13.9x on Illumina HiSeq X10 platforms, after dropping samples +that failed standard QC. Reads were aligned to GRCh38 with BWA-MEM, +variants were jointly called with GATK 4.0 HaplotypeCaller, and the +callset was hard-filtered with VQSR. The 4,480 unrelated samples +released for download were stratified into four Han Chinese regional +groups (North, Central, South and Lingnan, which together cover roughly +the 27 administrative divisions reached by the pilot). Allele counts +and frequencies are reported overall and per region. See Cong et al. +2022 (in References below) for full sample-selection and pipeline +details. +

+

+The per-chromosome WGS sites VCFs (chr1-22) were downloaded from +https://wbbc.westlake.edu.cn/ +(URL pattern: WBBC.chr<N>.GRCh38.vcf.gz). We concatenated +the 22 files with bcftools concat, re-headered the result to +add the standard hg38 contig lines and proper INFO definitions, then +dropped variants with cohort allele count zero (multi-allelic splits +that are not observed in the WBBC samples; ~1.9% of rows), and sorted, +bgzipped and tabix-indexed the result. No coordinate liftover was +needed: the upstream files are already on GRCh38 with chr-prefixed +chromosomes. The pipeline is recorded in the +makeDoc +file of the track. +

+ +

Caveats

+

+Only autosomes (chr1-22) are present; chrX/Y/M are not in the WBBC +download. Variants reported as AC=0 in the WBBC release (about 1.9 % +of rows, mostly multi-allelic split sites that no WBBC individual +carries) have been removed from this track. +

+ +

Data Access

+

+The variant frequencies can be explored interactively using the +Table Browser or the +Data Integrator, and exported to +spreadsheet or tab-separated tables. From scripts, the data can be +accessed via our REST +API with track=wbbc. +

+

+The VCF file is also available from +our +download server as wbbc.vcf.gz. Individual regions can be +extracted with tabix, for example +tabix http://hgdownload.soe.ucsc.edu/gbdb/hg38/varFreqs/wbbc/wbbc.vcf.gz chr21:1-100000000. +The original per-chromosome WBBC release is distributed at +https://wbbc.westlake.edu.cn/. +

+ +

Credits

+

+Thanks to the WBBC participants and to the Westlake University team +(Pei-Kuan Cong, Hou-Feng Zheng and colleagues) for making the pilot +sites-only VCFs publicly available. +

+ +

References

+ + +

+Cong PK, Bai WY, Li JC, Yang MY, Khederzadeh S, Gai SR, Li N, Liu YH, Yu SH, Zhao WW et al. + +Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project. +Nat Commun. 2022 May 26;13(1):2939. +PMID: 35618720; PMC: PMC9135724 +

+