USBE outlines synthetic‑data pilot to let researchers test code without exposing student records

A USBE presenter described an internal synthetic data project that produces artificial datasets resembling real LEA submissions so researchers and vendors can develop and test code without access to identifiable student records.

The presenter said the workflow trains machine‑learning models on real data to learn structure and then generates synthetic datasets that mimic distributions without containing real student entries. “No real student data goes into the system,” the presenter said, explaining the process includes model selection and checks for disclosure risk. The presentation cited differential‑privacy techniques (described in the meeting as a mathematical approach to privacy) and noted USBE is working with the University of Utah to review code.

Why USBE is pursuing synthetic data: presenters said it enables faster turnaround on research and developer testing (for example, testing an SIS or a machine‑learning model) without repeated board approvals or data‑sharing agreements. The synthetic datasets would allow external parties to run code and return only outputs or code rather than raw data.

Limits and caution: presenters described the effort as still early‑stage. They said the synthetic data are intended to be useful for development and algorithm testing, not for producing authoritative analytic estimates about the population; presenters emphasized documentation and validation steps will accompany any release.

Next steps: USBE staff said the synthetic dataset will be reviewed by partner universities, and availability will be determined as policies and validation progress. Attendees asked whether the product would be used to train third‑party large models; presenters said it could be but that future usage and any commercial arrangements are still being evaluated.

USBE outlines synthetic‑data pilot to let researchers test code without exposing student records

Summary