FDA researchers present pilot LLM tools to speed regulatory review; internal tests show promising accuracy

Dr. Hsu, Branch Chief of Research to Review and Return (R2R) at the Division of Bioinformatics and Biostatistics within the FDA's National Center for Toxicological Research (NCTR), presented internal pilots using large language models to support regulatory review at an FDA Grand Rounds session.

Dr. Hsu framed the work under the agency's "AI for Talks" initiative and said the team aims to apply advanced AI to improve the efficiency and accuracy of information retrieval and safety assessment. "We believe that AI has great potential to step in and significantly accelerate and enhance the workflow," he said.

He described four use cases demonstrated by internal tools: (1) literature screening to prioritize abstracts and full texts for reviewers; (2) an "Ask FDA Documents" chatbot that queries uploaded guidance and review files and returns source links; (3) automated recognition of adverse events (AEs) and side‑by‑side labeling comparisons that highlight textual changes; and (4) a three‑step pipeline to detect duplicate adverse‑event reports.

On literature screening, Dr. Hsu said the team benchmarked models on internal datasets and achieved high sensitivity with an "accessible specificity about 70%," adding that the screening tools could "improve the overall efficiency by 3 times" by filtering out irrelevant abstracts and surfacing likely relevant articles for human review.

For the Ask‑FDA chatbot, Dr. Hsu described a vector‑database approach: user queries are embedded, relevant document chunks are retrieved, prompts are applied to those chunks, and answers are aggregated and returned with hyperlinks to the source PDF pages. He demonstrated an example where the system returned a page number and a clickable link to the referenced guidance.

On labeling analysis, Dr. Hsu showed side‑by‑side visualizations that highlight words added, removed or changed between older and newer label versions and noted that models (including BERT‑style models) can tag newly recognized AEs for reviewer attention while preserving links to context.

For duplicate adverse‑event detection, Dr. Hsu outlined a filter→decider→validator workflow designed to cut the number of pairwise comparisons and surface candidate duplicates for human adjudication. He said the team tested a benchmark set of 2,358 real duplication pairs (evaluating A vs. B and B vs. A) and reported that for roughly 900 unique cases they could "recover all detect[] all duplicate cases" at over 90% recall; he added that in about 8% of cases at least one duplicate was detected and in under 2% the model failed to find ground‑truth duplicates.

Dr. Hsu emphasized the project is task‑specific and depends on infrastructure (GPUs/cloud), high‑quality domain datasets annotated with SME input, and close collaboration between AI scientists and subject matter experts to design prompts and evaluate outputs. He said the team's approach favors locally hosted models for security and future customization but noted the platform can integrate cloud models via APIs if desired.

Dr. Hsu concluded by listing five major components for adopting generative AI at the agency — infrastructure, datasets, tooling, SMEs, and evaluation — and reiterated that curated, diverse task‑specific datasets and SME review are essential to deploy these tools safely and effectively. The presentation was made available to the agency's internal testers and the team invited others to request demos or sign up for accounts for further evaluation.

FDA researchers present pilot LLM tools to speed regulatory review; internal tests show promising accuracy

Summary