Legal-tech panel lays out practical testing methods for AI used in legal help

AI & Law Conference — Testing Workshop Panel · January 30, 2026

Get AI-powered insights, summaries, and transcripts

Subscribe
AI-Generated Content: All content on this page was generated by AI to highlight key points from the meeting. For complete details and context, we recommend watching the full video. so we can fix them.

Summary

Panelists from Suffolk, Stanford and People's Law School described case studies (Fetch classifier, Beagle Plus), recommended simple safety/value rubrics, hybrid LLM-as-judge workflows, monitoring strategies and precautions against test-dataset contamination.

Quentin Stenhouse, a legal-technology specialist, and a panel from Stanford Law School and People’s Law School outlined practical methods for testing AI systems designed to help people with legal problems, stressing clear goals, representative test data, and ongoing monitoring.

The session highlighted two case studies. Stenhouse described the Fetch legal-issue classifier, built to match people to attorneys in an Oregon referral system. Fetch uses ensembles of models and up to three follow-up questions to narrow a caller’s issue; Stenhouse said the team evaluated roughly 400 labeled examples taken from real intake data and compared baseline keyword and older ML models (keyword matching ≈54% accuracy; an older model “Spot” ≈60%). That testing led the team to adopt a voting/ensemble approach that delivered stronger accuracy at lower cost and that could be rerun quickly (a full run through their examples takes about half an hour).

Drew Jackson of People’s Law School described Beagle Plus, a retrieval-augmented chatbot launched in February 2024 that has handled more than 15,000 conversations. Jackson said 82% of respondents rated Beagle Plus helpful overall and that helpfulness rose to about 90% in a subset of roughly 1,000 interactions after upgrading to a frontier model (GPT-4.1). Jackson said the project uses Pinecone for vector search, Contentful as a CMS, Langfuse for monitoring and a small content-review team (about 0.4 full-time-equivalent) to review flagged conversations.

Both presenters emphasized two evaluation dimensions they use consistently: safety (accurate legal information; warnings about practical or criminal risks) and value (jurisdiction-specific, actionable guidance). They described producing a small, curated testing dataset (for example, a 42-question “question set” used during pre-launch testing), crafting ideal responses, and having legal content developers score system outputs on safety and value. To scale review, they explained a hybrid workflow: an LLM-as-judge automates scoring against rubric criteria and flags items that need human attention, with humans confirming or overruling automated judgments.

Panelists warned about common pitfalls and offered operational guidance. They advised against over-testing every possible model–prompt combination, recommending instead a pragmatic approach that focuses formal evaluation on a few prioritized model configurations. For post-deployment monitoring, they recommended automated scoring to triage reviews and sampling or severity-based human audits rather than exhaustive manual review of all conversations.

In audience Q&A, speakers addressed dataset contamination concerns: if a public test dataset is later ingested by a foundation model’s training corpus, the dataset could be less reliable as a holdout. Stenhouse and Jackson recommended practical mitigations: keep primary labeled test sets private when possible, use larger secondary holdout samples, generate synthetic variants, and share labeled examples with trusted partners from other jurisdictions to increase test breadth without making core test labels public. They also discussed methods for evaluating multi-turn (multi-exchange) conversations—options included AI-generated “digital twin” conversations, constrained simulation with known facts, and more labor-intensive human evaluation.

The panel closed by urging organizations to invest in representative test data and clear rubrics, to adopt hybrid automated/human monitoring, and to participate in shared data resources and workshops to raise testing practices across the public legal-help community. Presenters invited attendees to a follow-up testing workshop and asked participants to share labeled data or questions to strengthen shared test sets.