FHIR Agents at ICML 2026: 8B Model Beats o4-mini & GLM-5

Our paper "Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR)" has been accepted to ICML 2026, an influential AI conference. In it, we use reinforcement learning to train a retrieval agent that answers clinical questions over real patient data in FHIR form. On the benchmark FHIR-AgentBench, our compact 8B model clearly outperforms much larger commercial systems such as o4-mini, Gemini 3 Flash, and GLM-5, at a fraction of the model size and 100% on-prem.

What is this about?

Hospitals increasingly exchange patient data through a common standard: FHIR (Fast Healthcare Interoperability Resources). FHIR is an established format for interoperable healthcare data exchange, and at the same time a data structure that creates new challenges when answering clinical questions.

The reason: FHIR is not a flat dataset, but a directed, typed graph. A patient points to their encounters (Encounter), which in turn point to observations (Observation), conditions (Condition), and medication requests (MedicationRequest). Each of these resources can contain further references.

A single clinically relevant question such as "Which medication was administered to the patient first during their most recent admission?" therefore requires multiple steps: find the right encounter, identify the linked medication requests, resolve the referenced medication resource, sort by time. And all of this, without the data model looking the same everywhere. FHIR standardizes the interface, but in practice we encounter several types of heterogeneity, including:

Semantic heterogeneity: The same clinical information is expressed in different terminologies, coding systems, or units.
Structural heterogeneity: Optional fields are filled in differently, local profiles add custom fields, and the same fact may be modeled in different resource types.

This variety makes every FHIR server in practice a bit of a unique system.

Why small, local models matter in the clinic

Language models are a promising building block for making clinical data more accessible — for example as an interface that lets clinicians ask questions of the patient record in natural language instead of navigating nested FHIR structures. In many domains, large commercial models (GPT-5, Gemini, Claude) are the simplest answer. Not in the clinic: patient data often cannot be transferred to external cloud services for data protection reasons. Anyone who wants to deploy LLMs inside the hospital is therefore reliant on smaller, locally operated models, which typically perform significantly worse than their larger counterparts.

Our work shows that this performance gap is not inevitable, provided the model is specialized for the task.

What we built

Our approach consists of an agent that interacts with a real FHIR server through two tools, and a reinforcement learning procedure that teaches it when to use them. Training and evaluation run on FHIR-AgentBench (Lee et al., 2025), a benchmark with 2,931 clinically motivated questions over real, de-identified patient records from MIMIC-IV in FHIR format.

The agent has two tools available:

A retrieval tool for the FHIR database that loads resources of a specific type for a patient.
A Python interpreter that transforms the returned JSON data: filtering, sorting, aggregating, resolving references. Python is a good fit here because these operations on nested JSON can be expressed in just a few lines of code.

The agent works in a ReAct loop: at each step it thinks briefly, calls one of its tools, and reads the return value before planning the next step. Because one of the actions is executable code, the setup corresponds to a CodeAct variant of ReAct — ReAct with code as the action format. The agent decides on its own when it can deliver the final answer or when it needs another investigation step.

An automatic LLM judge compares the answer against the ground truth from FHIR-AgentBench and returns a binary reward signal (correct/incorrect). From these signals, the model learns step by step which tool calls in which order lead to the goal.

Training pipeline: agent runs through multiple tool calls, judge evaluates the answer, model learns from the reward signal

Figure 1: Training pipeline. The agent answers a question across several tool calls, a judge evaluates the answer, and the model learns from the reward signal.

Important: we do not train the model on a static dataset of predefined solution paths. It learns by interacting with the real FHIR server and in the process discovers for itself how the data is organized in this specific server.

The results

Example run: agent loads FHIR resources, analyzes them with Python, and delivers the correct answer

Figure 2: Example run. The agent loads the relevant FHIR resources from the server, analyzes them with Python, and arrives at the correct answer.

Specialization beats scale

We compare our trained 8B agent against different Qwen3 sizes without specialization (4B, 8B, 14B, 32B) and against commercial API models: o4-mini (OpenAI), Gemini 3 Flash (Google), and GLM-5 (Zhipu, 744B parameters). The FHIR-AgentBench study reports 50% accuracy with o4-mini in the best setup. Our RL-trained model reaches 77%.

Answer correctness on FHIR-AgentBench (left) and accuracy by resource type over training (right)

Figure 3: Left: answer correctness on FHIR-AgentBench. Our RL-trained Qwen3-8B reaches 77% and clearly outperforms all API baselines. Right: accuracy by FHIR resource type over the course of training. The agent first learns to answer "empty" questions correctly, then gradually the harder resource types; questions about medications remain the biggest hurdle.

The model also becomes more efficient

A nice side effect: after training, the agent needs fewer steps per query. Untrained Qwen3-8B spreads its (rarer) successes over up to 12 tool calls; the trained model answers most questions in well under 6 steps. This reduces latency and operating cost in production.

Limitations and outlook

We broke down the errors by resource type. The hardest questions are those where the agent has to resolve references between resources, typically the pair MedicationRequest → Medication. The agent generally masters reference resolution but does not yet apply this skill as reliably as the simpler lookup steps. This is where we see the clearest lever for further improvements. A note on framing: 77% accuracy is a significant jump over the state of the art, but far below the threshold for autonomous clinical decisions. Our agent is intended as a tool for medical professionals, not as a replacement.

What does this mean in practice?

We show a concrete way for hospitals to operate capable AI assistants without sending patient data out of the house. The training procedure can be adapted to the local idiosyncrasies of any FHIR server, an important point because every institution uses its own variants.

We see our agent as a bridge between FHIR data and advanced medical AI applications: diagnostic models, risk predictors, and clinical assistants all ultimately need access to the right patient data. Building exactly this bridge is the goal: specialized, local, privacy-compliant, and on par with the large models.

The paper is joint work with Robert Müller, and is available in full on arXiv. Questions, collaboration ideas, criticism: hallo@idmedizin.de.

FHIR Agents: AI Learns the Language of Hospital Data