Article Data

  • Views 216
  • Dowloads 43

Original Research

Open Access

Evaluating ChatGPT versions 3.5, 4.0, and 5.0 in patient-and guideline-based questions on Optilume® therapy

Evaluación de las versiones 3.5, 4.0 y 5.0 de ChatGPT en preguntas basadas en pacientes y guías sobre la terapia Optilume®

  • Eralp Kubilay1,*,
  • Hüseyin Gültekin1

1Department of Urology, Near East University, 1010 Nicosia, Cyprus

DOI: 10.22514/j.androl.2026.007 Vol.24,Issue 1,March 2026 pp.41-48

Submitted: 25 August 2025 Accepted: 23 October 2025

Published: 30 March 2026

*Corresponding Author(s): Eralp Kubilay E-mail: eralp.kubilay@neu.edu.tr

Abstract

Background: We aimed to evaluate the accuracy, completeness, and reproducibility of ChatGPT versions 3.5, 4.0, and 5.0 in responding to patient-oriented and guideline-based questions on Optilume® therapy. Methods: Twenty structured questions were developed from patient Frequently Asked Questions (FAQs) social media forums, and procedural guidelines, covering five thematic domains: Device Mechanism and Indications, Procedural Technique, Outcomes and Efficacy, Complications, and Postoperative Management. Each question was posed to ChatGPT versions 3.5 (free), 4.0 (subscription), and 5.0 (latest subscription). Two independent urologists graded responses using a four-point scale (completely correct, correct but incomplete, partially misleading, completely incorrect). Question difficulty and source type (FAQ vs. guideline-based) were also analyzed. Reproducibility was assessed using Cohen’s kappa. Results: Overall, ChatGPT performance improved progressively with each version. Combined success rates (completely correct + correct but incomplete) were 75% for 3.5, 85% for 4.0, and 90% for 5.0. Device Mechanism and Procedural Technique domains achieved the highest accuracy across all versions, while Outcomes, Complications, and Postoperative Management improved notably in versions 4.0 and 5.0. FAQs were answered more accurately than guideline-based questions, with version 5.0 reaching 90% vs. 60%, respectively. Response accuracy increased with repeated versions, even for medium- and high-difficulty questions. Reproducibility was excellent, with Cohen’s kappa ranging from 0.82 to 0.91. Conclusions: ChatGPT, particularly versions 4.0 and 5.0, provides accurate, reproducible, and clinically relevant information on Optilume therapy.


Resumen

Antecedentes: Nuestro objetivo fue evaluar la precisión, integridad y reproducibilidad de las versiones 3.5, 4.0 y 5.0 de ChatGPT al responder preguntas orientadas al paciente y basadas en pautas sobre la terapia Optilume. Métodos: Se desarrollaron veinte preguntas estructuradas a partir de preguntas frecuentes de pacientes, foros en redes sociales y guías de procedimiento, que abarcan cinco áreas temáticas: Mecanismo e indicaciones del dispositivo, Técnica del procedimiento, Resultados y eficacia, Complicaciones y Manejo posoperatorio. Cada pregunta se formuló en las versiones 3.5 (gratuita), 4.0 (con suscripción) y 5.0 (última suscripción) de ChatGPT. Dos urólogos independientes calificaron las respuestas mediante una escala de cuatro puntos (completamente correcta, correcta pero incompleta, parcialmente engañosa, completamente incorrecta). También se analizaron la dificultad de las preguntas y el tipo de fuente (preguntas frecuentes vs. basada en guías). La reproducibilidad se evaluó mediante el índice kappa de Cohen. Resultados: En general, el rendimiento de ChatGPT mejoró progresivamente con cada versión. Las tasas de éxito combinadas (completamente correctas + correctas pero incompletas) fueron del 75% para la versión 3.5, del 85% para la versión 4.0 y del 90% para la versión 5.0. Los dominios de Mecanismo del Dispositivo y Técnica de Procedimiento alcanzaron la mayor precisión en todas las versiones, mientras que los dominios de Resultados, Complicaciones y Manejo Postoperatorio mejoraron notablemente en las versiones 4.0 y 5.0. Las preguntas frecuentes se respondieron con mayor precisión que las preguntas basadas en guías, alcanzando la versión 5.0 el 90% frente al 60%, respectivamente. La precisión de las respuestas aumentó con las versiones repetidas, incluso para preguntas de dificultad media y alta. La reproducibilidad fue excelente, con un índice kappa de Cohen que osciló entre 0.82 y 0.91. Conclusiones: ChatGPT, especialmente las versiones 4.0 y 5.0, proporciona información precisa, reproducible y clínicamente relevante sobre la terapia Optilume.


Keywords

ChatGPT; Optilume therapy; Urethral stricture; Artificial intelligence; Medical education


Palabras Clave

ChatGPT; Terapia optilume; Estenosis uretral; Inteligencia artificial; Educación médica


Cite and Share

Eralp Kubilay,Hüseyin Gültekin. Evaluating ChatGPT versions 3.5, 4.0, and 5.0 in patient-and guideline-based questions on Optilume® therapyEvaluación de las versiones 3.5, 4.0 y 5.0 de ChatGPT en preguntas basadas en pacientes y guías sobre la terapia Optilume®. Revista Internacional de Andrología. 2026. 24(1);41-48.

References

[1] Mundy AR, Andrich DE. Urethral strictures. BJU International. 2011; 107: 6–26.

[2] Rourke KF, Welk B, Kodama R, Bailly G, Davies T, Santesso N, et al. Canadian Urological Association guideline on male urethral stricture. Canadian Urological Association Journal. 2020; 14: 305–316.

[3] Herout R, Flegar L, Putz J, Eisenmenger N, Huber J, Thomas C, et al. Increasing utilization of urethroplasty for male urethral stricture disease: analysis of in-hospital interventions in Germany from 2006 to 2023. International Urology and Nephrology. 2025; 57: 3559–3565.

[4] Pang KH, Chapple CR, Chatters R, Downey AP, Harding CK, Hind D, et al. A systematic review and meta-analysis of adjuncts to minimally invasive treatment of urethral stricture in men. European Urology. 2021; 80: 467–479.

[5] Goulao B, Carnell S, Shen J, MacLennan G, Norrie J, Cook J, et al. Surgical treatment for recurrent bulbar urethral stricture: a randomised open-label superiority trial of open urethroplasty versus endoscopic urethrotomy (the OPEN Trial). European Urology. 2020; 78: 572–580.

[6] Estaphanous P, Khalifa AO, Makar Y. Efficacy and safety of optilume drug-coated balloon for urethral stricture treatment: a systematic review and meta-analysis. Cureus. 2024; 16: e74069.

[7] VanDyke ME, Morey AF, Coutinho K, Robertson KJ, D’Anna R, Chevli K, et al. Optilume drug-coated balloon for anterior urethral stricture: 2-year results of the ROBUST III trial. BJUI Compass. 2023; 5: 366–373.

[8] Mann RA, Virasoro R, DeLong JM, Estrella RE, Pichardo M, Lay RR, et al. A drug-coated balloon treatment for urethral stricture disease: two-year results from the ROBUST I study. Canadian Urological Association Journal. 2021; 15: 20–25.

[9] DeLong J, Virasoro R, Pichardo M, Estrella R, Rodríguez Lay R, Espino G, et al. Long-term outcomes of recurrent bulbar urethral stricture treatment with the optilume drug-coated balloon: five-year results from the ROBUST I study. The Journal of Urology. 2025; 213: 90–98.

[10] Moulaei K, Yadegari A, Baharestani M, Farzanbakhsh S, Sabet B, Reza Afrash M. Generative artificial intelligence in healthcare: a scoping review on benefits, challenges and applications. International Journal of Medical Informatics. 2024; 188: 105474.

[11] Yan Z, Fan KQ, Zhang Q, Wu X, Chen Y, Wu X, et al. Comparative analysis of the performance of the large language models DeepSeek-V3, DeepSeek-R1, open AI-O3 mini and open AI-O3 mini high in urology. World Journal of Urology. 2025; 43: 416.

[12] Yudovich MS, Makarova E, Hague CM, Raman JD. Performance of GPT-3.5 and GPT-4 on standardized urology knowledge assessment items in the United States: a descriptive study. Journal of Educational Evaluation for Health Professions. 2024; 21: 17.

[13] Ergin İE, Sancı A. Can ChatGPT help patients understand their andrological diseases? Revista Internacional de Andrología. 2024; 22: 14–20.

[14] Abdelmalek G, Uppal H, Garcia D, Farshchian J, Emami A, McGinniss A. Leveraging ChatGPT to produce patient education materials for common hand conditions. Journal of Hand Surgery Global Online. 2024; 7: 37–40.

[15] Beyatlı M, Güngör HS, İnkaya A, Sobay R, Tahra A, Küçük EV. Expert evaluation of ChatGPT-4 responses to upper tract urothelial carcinoma questions: a prospective comparative study with guideline-based and patient-focused queries. Journal of Clinical Medicine. 2025; 14: 6353.

[16] Caglar U, Yildiz O, Meric A, Ayranci A, Gelmis M, Sarilar O, et al. Evaluating the performance of ChatGPT in answering questions related to pediatric urology. Journal of Pediatric Urology. 2024; 20: 26.e1–26.e5.

[17] van Nuland M, Erdogan A, Aςar C, Contrucci R, Hilbrants S, Maanach L, et al. Performance of ChatGPT on factual knowledge questions regarding clinical pharmacy. The Journal of Clinical Pharmacology. 2024; 64: 1095–1100.

[18] Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, et al. Performance of ChatGPT across different versions in medical licensing examinations worldwide: systematic review and meta-analysis. Journal of Medical Internet Research. 2024; 26: e60807.

[19] Yurtcu E, Ozvural S, Keyif B. Analyzing the performance of ChatGPT in answering inquiries about cervical cancer. International Journal of Gynecology & Obstetrics. 2025; 168: 502–507.

[20] Chustecki M. Benefits and risks of AI in health care: narrative review. Interactive Journal of Medical Research. 2024; 13: e53616.

[21] Kuziemsky CE, Chrimes D, Minshall S, Mannerow M, Lau F. AI quality standards in health care: rapid umbrella review. Journal of Medical Internet Research. 2024; 26: e54705.


Submission Turnaround Time

Top