Design choices and safeguards behind Lone Star Legal Aid’s chatbot pilot

At a San Antonio presentation, Lone Star Legal Aid’s technical lead, Suru, described the engineering and safety choices used to reduce AI hallucination and protect privacy across three chatbots.

The team uses retrieval‑augmented generation (RAG) instead of relying on model training data: documents are ingested, segmented and embedded; on queries the system retrieves relevant chunks and the LLM answers only from those sources. Suru said they shifted from size‑based chunking to chapter/section chunking and attached metadata (chapter, section and subheaders) to chunks so the model can distinguish similar phrases in different contexts. "Once we did this, our hallucinations were, like, cut drastically to almost nothing," he said.

They tested four retrieval pipelines—Michelangelo (semantic search), Rafael (semantic + keyword), Leonardo (more context retention) and Donatello (uses an LLM to filter and return the best sources). Each pipeline has different speed, cost and accuracy tradeoffs; Michelangelo showed about a 50% user approval rating in early tests. The team logs time‑to‑answer, cost per question and number of sources used, and asks SME testers for thumbs up/down and reasons (wrong answer, hallucination, incorrect source, performance issue).

Prompt design focuses on enforceable behavior rather than style: they set a role primer ("You are Juris, a legal research assistant"), repeat constraints within prompts, hard‑code refusal behavior to avoid fabrication, and force standardized response formatting with citations and excerpted source text beneath answers to aid audits and staff verification.

Memory and access choices are deliberate: internal bots use enterprise SSO and have no memory by design to reduce case bleed; Navi, the external bot, keeps session‑only memory to allow follow‑ups while the browser tab is open and explicitly does not retain information after the session ends. Logging tracks usage and performance but excludes private identifying information; data governance includes least‑privileged access, end‑to‑end encryption through Azure protocols and automated purging of defined data lifecycles.

On tooling, Suru said the stack includes OpenAI models through Azure’s OpenAI service, Python, LangChain and ChromaDB for vector storage; they moved parts of their pipeline from GitHub to Azure for size/transfer reasons. For embeddings they currently use OpenAI embeddings via Azure; keyword search is handled with a BM25 index.

During Q&A the team acknowledged open questions: final retrieval method selection, exact per‑conversation costs (they estimated $0.06–$0.12 per question in testing), and spam protection for an eventual public rollout. The speakers noted these operational details remain under evaluation before broad public deployment.

Design choices and safeguards behind Lone Star Legal Aid’s chatbot pilot

Summary