"GPT-5.1 Thinking shows potential to support infection surveillance under strict constraints but exhibits systematic limitations, including overreliance on clinical intuition and difficulty with complex exclusion pathways" Alzyood et al (2026).
Healthcare-associated infection surveillance

Abstract:

Background: Large language models (LLMs) are increasingly explored for healthcare-associated infection (HAI) surveillance, but their reliability in applying formal National Healthcare Safety Network (NHSN) definitions is not well characterized. This study evaluates GPT-5.1 Thinking’s accuracy and rationales in classifying NHSN-defined infections.

Methods: Seventy synthesized case vignettes containing complete, organized clinical data representing five NHSN infection types, including complex edge cases, were assessed using 2025 NHSN surveillance definitions. GPT-5.1 Thinking classified cases under three prompting strategies: standard, structured, and constrained. Quantitative accuracy metrics and qualitative inductive content analysis of rationales and failure modes were performed.

Results: Overall accuracy across 210 classifications improved from 78.6% (standard prompt) to 88.6% (structured) and 95.7% (constrained). Performance was highest for infections with clear anatomical or radiographic criteria (surgical site infections [SSI], ventilator-associated pneumonia [VAP]) and lowest for infections involving complex exclusion rules (central line-associated bloodstream infection [CLABSI], Clostridioides difficile infection [CDI]). Constrained prompting enhanced adherence to NHSN rules but did not eliminate errors in hierarchical exclusions. Content analysis identified three recurrent failure categories: prioritization of clinical plausibility over surveillance logic, failure to apply quantitative and temporal thresholds, and errors in hierarchical source attribution.

Conclusion: GPT-5.1 Thinking shows potential to support infection surveillance under strict constraints but exhibits systematic limitations, including overreliance on clinical intuition and difficulty with complex exclusion pathways. Currently, LLMs are unsuitable for autonomous NHSN classification but may serve as supervised decision-support tools with robust human oversight. Further development is needed to enhance LLMs’ ability to synthesize surveillance definitions and complex situational characteristics critical for effective HAI surveillance, though fully autonomous deployment would require further validation. These findings are based on synthetic data that may differ from real-world clinical data in ways likely to overestimate the accuracy of these tools.

Reference:

Alzyood M, Veldhuis A, Stevenson H, Sheikh S. Hidden failure modes of large language models in healthcare-associated infection surveillance: a structured evaluation using NHSN definitions. Infect Control Hosp Epidemiol. 2026 Apr 6:1-6. doi: 10.1017/ice.2026.10444. Epub ahead of print. PMID: 41937595.