Lifetime Citizen Portal Access — AI Briefings, Alerts & Unlimited Follows
How to find and download whole‑genome sequencing files from the GDC: a step‑by‑step demo
Loading...
Summary
GDC staff demonstrated using the cohort builder to filter for lung cancer whole‑genome cases, selecting Mutect2/Strelka/Svava/Manta outputs, saving a cohort, adding files to a cart, and recommended using manifests + the GDC Data Transfer Tool for large controlled‑access downloads (dbGaP access required).
Bill, director of user services at the Genomic Data Commons, demonstrated building a cohort and preparing a download of annotated somatic VCFs and Manta BEDPE files for lung cancer cases during the webinar.
He began with the cohort builder, showing an initial universe of 44,736 cases in the portal and filtering by tissue/organ (“lung”) to narrow to 5,781 lung‑cancer cases. Bill then filtered the available data by experimental strategy (Whole Genome Sequencing) and by workflow type, selecting the GATK4 Mutect2 annotation, Svava indel annotation, and Manta outputs. That three‑pipeline filter resulted in 1,249 cases and roughly 8,000 files linked to those cases in the repository.
Bill demonstrated adding the selected files to the GDC cart (examples used in the demo: 1,673 files added to the cart, then an additional 733 Manta BEDPE files for a total of about 2,400 files). He noted the practical challenges of large browser downloads and recommended exporting a manifest and using the GDC Data Transfer Tool, which resumes interrupted transfers, rather than downloading large controlled‑access files through a browser.
Bill reminded attendees that many variant files are controlled‑access: users must obtain dbGaP authorization for the respective project (for example, dbGaP access to TCGA to retrieve TCGA controlled files). For developers, he outlined an API workflow: download a manifest, extract file IDs, and use the API for metadata or direct file retrieval if authorized.
For clinical/temporal cohort building (e.g., diagnosis vs. relapse), Bill pointed to specific clinical fields available in custom filters — 'days to sample procurement', 'index date', 'days to treatment start' and recurrence/follow‑up fields — and noted that while there is no single turnkey filter for every use case, custom queries can assemble cohort sets by timepoint.
Bill closed by pointing users to the GDC documentation (docs.gdc.cancer.gov) and the public GitHub repository (github.com/nci‑gdc/gdc‑workflow‑overview) for workflow descriptions, CWLs and Docker files.
Sources: statements and demo actions by Bill (Director of User Services, GDC) during the webinar.

