The hallucination that no one catches in testing
Here is a failure mode that will not show up in any test suite built by a team that has not shipped to production with real patients.
A patient calls your AI voice agent. They say they need an appointment next Tuesday at 2 PM with Dr. Martinez. The agent responds: "I have scheduled your appointment for Tuesday at 2 PM with Dr. Martinez. Is there anything else I can help you with?" The patient says thank you and hangs up.
But the agent never called the scheduling tool. It never checked Dr. Martinez's availability. It never created the appointment record. It hallucinated the entire confirmation. The patient shows up on Tuesday to an appointment that does not exist.
In primary care, that is an inconvenience. In behavioral health — which is where our system runs — that patient might have been in crisis. They might have waited a week for that appointment. They might have rearranged their work schedule, found childcare, and driven across town. And when they arrive and the front desk has no record of their visit, the trust that was fragile to begin with shatters completely.
This happened to us. Not in a test environment. In production. With a real patient.
The instinct when this happens is to fix the prompt. Add an instruction: "Always call the scheduling tool before confirming an appointment." Increase the emphasis: "CRITICAL: You MUST use the booking function." Add guardrails: "Never tell a patient their appointment is confirmed unless the tool returned a success response."
None of this works reliably. And here is why.
Large language models generate text that is statistically likely given the conversation context. If the patient asked for an appointment and the conversation pattern suggests the next response should be a confirmation, the model will generate a confirmation — regardless of whether the underlying action was performed. The model does not know the difference between "I called the tool and it succeeded" and "the conversation would flow better if I said the appointment was confirmed." Both paths produce plausible text.
The fix is not in the prompt. The fix is structural. It lives in the architecture between the model and the user.
We built a verification layer that sits between the model's generated response and the patient. Before any response reaches the patient, the system checks: did the relevant tool actually get called? Did it return a success status? Does the response contain claims about actions that were not performed? If any check fails, the response is blocked and the model is re-prompted with explicit context about what actually happened.
This is not prompt engineering. This is software engineering applied to AI behavior. The model is a component in a larger system. When you treat it as the entire system, hallucinations are unpredictable failures. When you treat it as one layer with verification above and below, hallucinations are caught and corrected before they reach the patient.
We also learned something deeper about how to make AI behavior work reliably. Long prompts with extensive instructions do not produce reliable behavior. They produce models that have too many competing priorities and resolve conflicts unpredictably. The counterintuitive solution: shorter prompts with better tools. Move knowledge out of the static prompt and into dynamic tool responses. Let the model ask for information when it needs it rather than carrying everything in its context window.
When done correctly, this approach reduces prompt token usage dramatically while improving reliability. The model has fewer instructions to conflict with each other. The tools provide authoritative answers that the model can relay without fabrication. The verification layer catches anything that slips through.
This architecture — minimal prompts, intelligent tools, verification layers — is what makes anthropomorphic AI behavior actually work in production. Not because the model is smarter. Because the system around the model is engineered for the ways models fail.
Every healthcare AI company will encounter this hallucination pattern. The question is whether they encounter it in testing — where it is a data point — or in production — where it is a patient who showed up for an appointment that does not exist.
This is one piece of a larger framework we built and operate in production. The full picture — and how it applies to your business — is in the playbook.
We specialize in healthcare because it is the hardest vertical — strict HIPAA regulation, PHI handling, BAA chains, and zero tolerance for failure. If we can build it for healthcare, we can build it for any industry. We work across verticals.