Lifetime Citizen Portal Access — AI Briefings, Alerts & Unlimited Follows
GDC overhauls whole‑genome variant‑calling pipelines; Mutect2 data issue flagged
Loading...
Summary
The NCI Genomic Data Commons is replacing older Sanger WGS workflows with a multi‑caller GDC set (Mutect2, VarScan2, Strelka2, Svava/Striker, Manta, GATK4 CNV, ABSOLUTE) to enable ensemble calls and reduce compute; presenters disclosed missing Mutect2 variants on chromosomes 10 and 20 and said a corrected release is forthcoming.
Bill, director of user services at the Genomic Data Commons (GDC), and Zanyu, the GDC director of bioinformatics, described a major update to the GDC’s whole‑genome sequencing (WGS) variant‑calling pipelines during a public webinar.
The GDC is moving away from its older Sanger workflow set (CaVEMan, Pindel, ASCAT‑NGS, BRASS) and rolling out a new GDC workflow family that includes multiple somatic single‑nucleotide/indel callers (GATK4 Mutect2, VarScan2, Strelka2 and an indel resource from Svava), two structural‑variant callers (Manta and Svava/Striker), and CNV inference via GATK4 CNV with ABSOLUTE. Bill said the change increases callers for simple somatic mutations (SSMs) from two to four and structural‑variant callers from one to two to enable ensemble calling and broader variant support.
Zanyu said the multi‑caller approach is intended to improve confidence and coverage across variant types and to let users select intersections for high‑confidence sets or unions for sensitivity. He also said adopting more modern tools substantially reduced compute requirements and made it feasible to process the entire GDC WGS dataset on a practical timescale.
Zanyu gave counts and scope: the portal currently contains more than 1,500,000 cases with aligned WGS BAM files and, he said, “over 5,000 of those cases have Sanger variant calls.” He added that the number of cases with variant calls released from the new pipelines varies by tool (roughly 1,000 to 14,000 cases depending on the pipeline) and that the GDC will continue releasing completed pipeline outputs in subsequent data releases.
On data integrity, Zanyu disclosed an active issue with some Mutect2‑derived VCFs: “we recently identified a data integrity issue. Some of the variants on chromosome 10 and the chromosome 20 are missing from these VCFs.” He said the team is working on data fixes and expects corrected data in an upcoming release; users who have downloaded the affected files were advised to await the fix or to pay attention to the release notes.
For structural variants, Zanyu described a workflow that feeds Manta’s large candidate VCF into Striker 2 joint calling to filter and produce a higher‑quality Striker 2 output; Svava both calls SVs and produces an indel VCF that is incorporated into SSM ensembles. For CNVs, GATK4 CNV outputs are inputs to ABSOLUTE; ABSOLUTE produces multiple model PDFs and the GDC’s automatic extraction picks a default model, while curated ABSOLUTE outputs will be released later and flagged as preferred when available.
Zanyu also highlighted “hidden gem” outputs available in the portal: per‑BAM metrics (coverage, insert size, multi‑mapping reads), MSI status and MSI scores (MSIsensor output) attached to tumor BAMs, and tumor purity/ploidy estimates kept with segmentation files rather than summarized at the sample level to preserve model‑specific differences.
Bill advised users that the GDC will reprocess Sanger‑derived results with the new GDC workflows over time; Sanger processing will cease for new datasets. He pointed attendees to the GDC documentation and the public GitHub repository for workflow CWLs and Docker files.
What happens next: the presenters said completed pipeline outputs will be released as they are processed in upcoming GDC data releases; attendees were told to subscribe to the GDC listserv for release notifications. Zanyu said the team aims to publish corrected Mutect2 outputs soon and to expand ensemble releases until comprehensive variant sets are available for WGS cases in the portal.
Sources: statements and quotes from Bill (director of user services, GDC) and Zanyu (GDC director of bioinformatics) during the GDC WGS webinar.

