Microsoft is peering into the immediate future of medical diagnostics — one that will determine how much you pay for tests, how long you spend in line, and how quickly you hear the correct diagnosis. Microsoft AI assembled SDBench — a "testing ground" of 304 real clinical case discussions, but with an important distinction: as in real life, every step costs time and money. The metric is not guessing the disease, but $ per correct Dx — how many dollars are spent to arrive at the correct diagnosis when the doctor (or AI) sequentially: asks questions → orders tests → draws a conclusion. On top of the base models, the authors deployed MAI-DxO — an "orchestrator," i.e., a set of coordinated AI roles: one generates hypotheses, another selects tests, a third pushes back on expensive procedures, a fourth monitors costs, and a fifth follows a checklist. In this setup, AI becomes both more accurate and more affordable.
How the study was conducted: each case was converted into a chain of "visits"; prices were based on U.S. billing rates. Diagnostic accuracy was verified by an "LM judge" — another model that scored outputs according to a rubric (disease core, cause, localization, specificity, completeness). Where text could be ambiguous, agreement with physicians was measured; credit was given at 4 out of 5 points or higher. Physician participants (n=21) achieved 19.9% diagnostic accuracy at $2,963 per case; GPT-4o — 49.3% at $2,745; o3 — 78.6% at $7,850 (expensive, but more frequently ordered appropriate tests). With MAI-DxO on top of o3, the result was 81.9% at $4,735; in "budget mode" — 79.9% at $2,396; a model ensemble raised the bar to 85.5% at $7,184. Results held on a concealed validation set. An illustrative episode: in a poisoning case, the base model "latched onto" an incorrect hypothesis and "burned" $3,431; the orchestrator clarified the source (hand sanitizer) → ordered a single targeted test → reached the diagnosis for $795. For completeness, the authors compared their AI against other top models (OpenAI, Gemini, Claude, Grok, DeepSeek, Llama).
If confirmed in real-world practice, clinical workflows will shift from "chatbots" to agentic orchestrators with built-in budget logic: patient triage and consultations will more often land on the correct diagnosis at lower testing costs; insurers and regulators will gain a transparent metric — "dollars per correct diagnosis" — and the "black box" will open as an auditable step log; medical training will move from isolated exercises to sequential reasoning simulators. But the limitations matter: NEJM cases represent a complex "academic" selection (it is unclear how the model would perform on routine, straightforward patients); prices reflect U.S. rates and do not account for logistics, wait times, or invasiveness; physicians worked without consultations; and the "LM judge" assessment, while close to physician judgment, still requires validation against real outcomes. In other words: the potential is substantial — but before changing protocols, prospective clinical trials are needed.