Bone infection decisions are rarely constrained by a lack of clinical knowledge. They fail because the decision has to be taken while outcomes remain months away, and trade offs can be irreversible. That is already true without AI.
The useful way to think about AI in healthcare is not as a new decision-maker, but as a new timing mechanism. It changes what information appears earlier, what appears more confidently, and what silently stays missing.
Summary
Clinical AI inherits the structure of endpoints, follow-up windows, and reporting practices in the datasets it is trained on. (Zsidai, 2025).
In infection surveillance, different definitions and capture mechanisms produce different “truths”, even in the same national cohort, which directly limits what prediction models can safely claim. (Karlsen, 2025).
The deployment risk is often not wrong answers in obvious cases, but confident outputs in contested cases, where responsibility and liability remain with humans. (Winkler, 2025).
Why this matters
If an AI model is trained on registry outcomes, the registry’s endpoint choices become the model’s world. A concrete example comes from Norway, where two national systems track infection after primary total hip arthroplasty using different surveillance approaches. In one register (NOIS), 1.6% of 87,923 primary THAs were reported as 30-day surgical site infections, and 0.9% underwent re-operation for SSI within 30 days. In the arthroplasty register (NAR), the 30-day re-operation rate for periprosthetic joint infection was 0.8% (725 of 91,194), and the 1-year rate was 1.2% (1,019 of 91,194). (Karlsen, 2025).
Those numbers are not just epidemiology. They are a reminder that “infection” is operationalised through definitions, time windows, and reporting rules. A model trained on one endpoint cannot be assumed to generalise to another, even when everyone believes they are talking about the same complication. (Karlsen, 2025).
A practical scene before we talk about AI
Take a familiar fracture-related infection moment: follow-up after fixation, a wound that has not behaved, and a patient who wants certainty. The surgeon can explain risks, likely pathways, and what influences recurrence. The hard part is not explanation. The hard part is choosing between reasonable options when outcomes unfold slowly, and when “waiting to see” is itself a decision that can narrow future choices.
That matters because it clarifies what kinds of AI support map cleanly onto the work, and what kinds create new pressure.
What “AI in healthcare” usually looks like in real workflows
Most current clinical AI systems fall into two operational roles.
One role is prediction and pattern recognition from structured inputs such as imaging, labs, coded diagnoses, and registry variables. The other role is synthesis and retrieval, including drafting summaries, extracting patterns from records, or surfacing relevant literature. Orthopaedic reviews typically describe these clusters while emphasising that generalisability, bias, validation, and verification are decisive constraints, not footnotes. (Misir, 2025).
The key point is not that these systems exist. The key point is that deploying them exposes bottlenecks that clinical teams and commercial teams have historically been able to absorb informally.
The training requirements that become deployment bottlenecks
A practical data management guide from the ESSKA AI Working Group makes the underlying dependency explicit: orthopaedic AI draws from EHRs, registries, imaging databases, wearable and sensor data, and unstructured clinical notes, each with different completeness and governance constraints. (Zsidai, 2025).
In deployment terms, three bottlenecks tend to dominate.
Endpoint bottleneck
If the label is a re-operation code, the model learns what leads to re-operation. It does not learn about infections treated non-operatively, misclassified cases, prevented escalation, or the judgement calls that kept the patient out of theatre. Karlsen et al. note, for example, that in the arthroplasty register, the cause of re-operation is reported by the surgeon immediately and is not subsequently corrected based on bacterial findings, and infections that do not require re-operation are not captured. (Karlsen, 2025).
That is not a criticism of registries. It is a description of what the dataset can and cannot teach a model.
Context bottleneck
A model behaves like the environment that trained it. When reporting practices, follow-up completeness, and clinical thresholds differ between sites, performance can degrade in exactly the cases that create reputational and medico-legal exposure. The ESSKA guide on risks and verification frames this as a safety and verification problem, not just a performance problem. (Winkler, 2025).
Verification bottleneck
Many models look strongest in clean, common, retrospective cases and weakest in messy, borderline, high-stakes decisions. Verification is about stress testing where the system is likely to be used, not where it is easy to score. (Winkler, 2025).
This is the practical meaning of “AI inherits healthcare’s data infrastructure”. It is not a philosophical statement. It is a deployment constraint.
Where AI maps cleanly onto infection work, and where it creates new decision points
Where it helps
AI can reduce cognitive load in two ways that are immediately tangible.
It can improve consistency of information handling, by structuring what is already known and bringing relevant evidence or comparable cases into view more reliably than memory alone. (Misir, 2025).
It can surface weak signals earlier, particularly when risk is distributed across multiple small indicators that no single clinician would treat as decisive in isolation. (Misir, 2025).
Where it creates new exposure
The collision happens when earlier visibility is misread as earlier certainty.
Earlier risk signals can change the implied standard. Once a score is displayed, not acting can become harder to justify, even when the score is based on endpoints that miss decisive parts of infection care. This creates documentation pressure: the clinician is not only explaining medicine to a patient, but also defending a reasoning chain against an apparently objective number.
A conceptual framework for AI-supported shared decision-making makes a closely related point: technical explainability is not the same as clinical communicability, and shared decision-making requires outputs to be contestable and contextual, not merely transparent. (As’ad, 2025).
In other words, better explanation does not collapse value differences. It often makes disagreement more legible.
Failure modes: why “impressive” can still be unsafe
A useful reality check is that large language models vary widely in performance even when benchmarked against published guidance in osteoarticular infections. A study evaluating 15 LLMs on clinical cases reported variable performance across models and categories, which is a reminder that “LLM output” is not a stable capability and should not be treated as interchangeable across tools. (Borgonovo, 2025).
Outside clinical settings, the same failure mode is now visible at scale: confident medical summaries with weak sourcing and unclear provenance. Recent reporting has raised concerns about Google’s AI Overviews presenting incorrect health information with high authority, and about the sources these summaries cite. (The Guardian, 24 January 2026; The Verge, 2026).
The lesson for clinical and commercial stakeholders is not that summarisation is inherently bad. It is that provenance, source hierarchy, and uncertainty handling are safety features, not user interface polish.
What to keep in view: real progress that is not clinical replacement
AI is advancing rapidly in domains adjacent to infection care, including antimicrobial discovery pipelines using generative approaches. (Wang, 2025).
Digital twin concepts are also evolving, but the literature remains heterogeneous, and clinically grounded examples in orthopaedics are still use-case-driven rather than generalisable systems. (Tudor, 2025; Andres, 2025).
This matters because it suggests a realistic trajectory: more computable support around clinical work, not an exit from responsibility inside clinical work.
Common pitfalls that look like progress
Treating a risk score as an outcome prediction when it is actually an endpoint prediction shaped by re-operation thresholds, follow-up windows, and reporting rules. (Karlsen, 2025).
Equating clearer explanations with easier decisions, when the contested part is values, irreversibility, and acceptable uncertainty. (As’ad, 2025).
Mistaking cross-site performance claims for safety, without verification in the deployment context where accountability sits. (Winkler, 2025).
Selling “AI-enabled” as a procurement virtue without being able to state what the model does not learn from the dataset, and what happens when its output conflicts with clinical judgement. (Winkler, 2025).
Closing note
AI changes infection decisions by changing timing. It brings risk and structure forward in the pathway. That can reduce cognitive load, but it also creates new decision points that clinicians may have to defend and commercial teams may have to stand behind.
Before adopting or representing an infection-related AI tool, the most practical questions are often the simplest:
Which outcomes does this model actually learn from, and which clinically meaningful outcomes are structurally invisible in the data? (Karlsen, 2025; Zsidai, 2025).
What is the failure mode in borderline cases, and who carries responsibility when the model is confident and the decision remains contested? (Winkler, 2025; As’ad, 2025).
What evidence exists that performance and safety hold in the deployment setting, not only in the training setting? (Winkler, 2025).
References
Karlsen ØE et al. Trends in surgical site infection and periprosthetic joint infection after primary total hip arthroplasty in two national health registers 2013–2022. J Hosp Infect. 2025.
Zsidai B et al. A practical guide to the implementation of AI in orthopaedic research Part 5: Data management. J Exp Orthop. 2025.
Winkler PW et al. A practical guide to the implementation of AI in orthopaedic research Part 7: Risks, limitations, safety and verification of medical AI systems. J Exp Orthop. 2025.
As’ad M et al. AI-Supported Shared Decision-Making (AI-SDM): conceptual framework. JMIR AI. 2025.
Misir A et al. AI in Orthopedic Research: A Comprehensive Review. J Orthop Res. 2025.
Borgonovo F et al. Battle of the Bots: Solving Clinical Cases in Osteoarticular Infections With Large Language Models. Mayo Clin Proc Digit Health. 2025.
Wang Y et al. A generative artificial intelligence approach for the discovery of antimicrobial peptides against multidrug-resistant bacteria. Nat Microbiol. 2025.
Tudor BH et al. A scoping review of human digital twins in healthcare applications and usage patterns. NPJ Digit Med. 2025.
Andres A et al. Advantages of digital twin technology in orthopedic trauma surgery: exploring different clinical use cases. Sci Rep. 2025.
Gregory A. How the “confident authority” of Google AI Overviews is putting public health at risk. The Guardian. 2026.
Peters J. Google pulls AI overviews for some medical searches. The Verge. 2026.
The Guardian. Google AI Overviews cite YouTube more than any medical site for health queries, study suggests. The Guardian. 2026.