Article Data

  • Views 235
  • Dowloads 49

Original Research

Open Access

Establishing a performance benchmark for artificial intelligence in pediatric urology: an expert evaluation of ChatGPT and Gemini on circumcision

Establecimiento de un punto de referencia de rendimiento para la inteligencia artificial en urología pediátrica: evaluación experta de ChatGPT y Gemini en la circuncisión

  • Yasin Aktaş1,*,
  • Adem Tunçekin1

1Department of Urology, Faculty of Medicine, Uşak University, 64300 Uşak, Turkey

DOI: 10.22514/j.androl.2026.019 Vol.24,Issue 2,June 2026 pp.58-64

Submitted: 29 October 2025 Accepted: 31 December 2025

Published: 30 June 2026

*Corresponding Author(s): Yasin Aktaş E-mail: yasin.aktas@usak.edu.tr

Abstract

Background: Large language models (LLMs) based on artificial intelligence are increasingly being used for medical inquiries. However, the accuracy of these models regarding circumcision has not yet been evaluated. This study aimed to establish a performance benchmark by comparing the accuracy and quality of ChatGPT’s and Gemini’s responses to guideline-based and patient-focused circumcision questions. Methods: The comparative study was conducted from August to October 2025. A total of 50 questions were analyzed: 30 guideline-based/academic questions derived from the European Association of Urology and the American Urological Association guidelines and 20 patient-focused frequently asked questions (FAQs) from reputable sources. Two board-certified urologists evaluated ChatGPT-5 and Gemini 2.5 Flash responses using Binary Accuracy Scoring (BAS) and Detailed Accuracy Scoring (DAS). Inter-rater reliability was assessed using Cohen’s Kappa coefficient and 95% confidence intervals (CI) were calculated via the Clopper-Pearson method. The Wilcoxon signed-rank test was used to compare paired ordinal data between models, and the Mann-Whitney U test was used to compare the different question categories (guideline-based vs. patient-focused). Results: Regarding BAS, both models achieved 100% accuracy, meaning no factually incorrect answers were found. In terms of DAS, for guideline-based questions, both ChatGPT and Gemini achieved a “completely correct” rate of 93.3% (95% CI: 77.9–99.2%) (p = 0.705). For patient-focused FAQs, Gemini scored 85% (95% CI: 62.1–96.8%) and ChatGPT scored 75% (95% CI: 50.9–91.3%), with no statistically significant difference between the two models (p = 0.315). There were no significant differences between the guideline-based and patient-focused question groups for either model (ChatGPT: p = 0.68; Gemini: p = 0.322). Conclusions: Both models demonstrated high reliability, providing a preliminary performance benchmark for this specific domain. While no significant performance difference was observed between the models in this dataset, qualitative limitations necessitate a “Physician-in-the-Loop” workflow, employing LLMs as drafting agents under expert supervision.


Resumen

Antecedentes: Los modelos lingüísticos grandes (LLM) basados en inteligencia artificialse utilizan cada vez más para consultas médicas. Sin embargo, no se ha evaluado su precisión en lo que respecta a la circuncisión. El objetivo de este estudio fue establecer un punto de referencia de rendimiento comparando la precisión y la calidad de las respuestas de ChatGPT y Gemini a preguntas sobre la circuncisión basadas en directrices y centradas en el paciente. Métodos: El estudio comparativo se llevó a cabo entre agosto y octubre de 2025. Se analizaron un total de 50 preguntas: 30 preguntas basadas en directrices académicas derivadas de las directrices de la European Association of Urology y la American Urological Association y 20 preguntas frecuentes (FAQ) centradas en el paciente procedentes de fuentes acreditadas. Dos urólogos certificados evaluaron las respuestas de ChatGPT-5 y Gemini 2.5 Flash utilizando la puntuación de precisión binaria (BAS) y la puntuación de precisión detallada (DAS). La fiabilidad entre evaluadores se evaluó utilizando el coeficiente Kappa de Cohen y se calcularon los intervalos de confianza (IC) del 95% mediante el método de Clopper-Pearson. Se utilizó la prueba de rangos con signo de Wilcoxon para comparar los datos ordinales emparejados entre los modelos y la prueba U de Mann-Whitney para comparar las diferentes categorías de preguntas (basadas en directrices frente a centradas en el paciente). Resultados: En cuanto al BAS, ambos modelos alcanzaron una precisión del 100%, lo que significa que no se encontraron respuestas incorrectas desde el punto de vista fáctico. En cuanto al DAS, en las preguntas basadas en directrices, tanto ChatGPT como Gemini alcanzaron una tasa de “totalmente correctas” del 93.3% (IC del 95%: 77.9–99.2%) (p = 0.705). En cuanto a las preguntas centradas en el paciente, Gemini obtuvo una puntuación del 85% (IC del 95%: 62.1–96.8%) y ChatGPT, del 75% (IC del 95%: 50.9–91.3%), sin diferencias estadísticamente significativas entre ambos modelos (p = 0.315). No hubo diferencias significativas entre los grupos de preguntas basadas en directrices y centradas en el paciente para ninguno de los dos modelos (ChatGPT: p = 0.68; Gemini: p = 0.322). Conclusións: Ambos modelos demostraron una alta confiabilidad, proporcionando un punto de referencia de rendimiento preliminar para este dominio específico.Aunque no se observaron diferencias significativas en el rendimiento entre los modelos de este conjunto de datos, las limitaciones cualitativas requieren un flujo de trabajo “Physician-in-the-Loop”, en el que se emplean los LLM como agentes de redacción bajo la supervisión de expertos.

Keywords

Artificial intelligence; ChatGPT; Gemini; Circumcision; Pediatric urology; Large language models


Palabras Clave

Inteligencia artificial; ChatGPT; Gemini; Circuncisión; Urología pediátrica; Modelos de lenguaje grandes


Cite and Share

Yasin Aktaş,Adem Tunçekin. Establishing a performance benchmark for artificial intelligence in pediatric urology: an expert evaluation of ChatGPT and Gemini on circumcisionEstablecimiento de un punto de referencia de rendimiento para la inteligencia artificial en urología pediátrica: evaluación experta de ChatGPT y Gemini en la circuncisión. Revista Internacional de Andrología. 2026. 24(2);58-64.

References

[1] Fattah FH, Salih AM, Salih AM, Asaad SK, Ghafour AK, Bapir R, et al. Comparative analysis of ChatGPT and Gemini (Bard) in medical inquiry: a scoping review. Frontiers in Digital Health. 2025; 7: 1482712.

[2] Hirosawa T, Kawamura R, Harada Y, Mizuta K, Tokumasu K, Kaji Y, et al. ChatGPT-generated differential diagnosis lists for complex case-derived clinical vignettes: diagnostic accuracy evaluation. JMIR Medical Informatics. 2023; 11: e48808.

[3] Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, et al. Performance of ChatGPT across different versions in medical licensing examinations worldwide: systematic review and meta-analysis. Journal of Medical Internet Research. 2024; 26: e60807.

[4] Wei Q, Yao Z, Cui Y, Wei B, Jin Z, Xu X. Evaluation of ChatGPT-generated medical responses: a systematic review and meta-analysis. Journal of Biomedical Informatics. 2024; 151: 104620.

[5] Garg RK, Urs VL, Agarwal AA, Chaudhary SK, Paliwal V, Kar SK. Exploring the role of ChatGPT in patient care (diagnosis and treatment) and medical research: a systematic review. Health Promotion Perspectives. 2023; 13: 183–191.

[6] Ruksakulpiwat S, Kumar A, Ajibade A. Using ChatGPT in medical research: current status and future directions. Journal of Multidisciplinary Healthcare. 2023; 16: 1513–1520.

[7] Prabhakaran S, Ljuhar D, Coleman R, Nataraja RM. Circumcision in the paediatric patient: a review of indications, technique and complications. Journal of Paediatrics and Child Health. 2018; 54: 1299–1307.

[8] American Academy of Pediatrics Task Force on Circumcision. Male circumcision. Pediatrics. 2012; 130: e756–785.

[9] Friedman B, Khoury J, Petersiel N, Yahalomi T, Paul M, Neuberger A. Pros and cons of circumcision: an evidence-based overview. Clinical Microbiology and Infection. 2016; 22: 768–774.

[10] Simpson M. Urologic conditions in infants and children: circumcision. FP Essentials. 2020; 488: 11–15.

[11] von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP; STROBE Initiative. The strengthening the reporting of observational studies in epidemiology (STROBE) statement: guidelines for reporting observational studies. The Lancet. 2007; 370: 1453–1457.

[12] Beyatlı M, Güngör HS, İnkaya A, Sobay R, Tahra A, Küçük EV. Expert evaluation of ChatGPT-4 responses to upper tract urothelial carcinoma questions: a prospective comparative study with guideline-based and patient-focused queries. Journal of Clinical Medicine. 2025; 14: 6353.

[13] Tsen HF, Morgenstern H, Mack T, Peters RK. Risk factors for penile cancer: results of a population-based case-control study in Los Angeles County (United States). Cancer Causes & Control. 2001; 12: 267–277.

[14] Philippou P, Shabbir M, Malone P, Nigam R, Muneer A, Ralph DJ, et al. Conservative surgery for squamous cell carcinoma of the penis: resection margins and long-term oncological control. Journal of Urology. 2012; 188: 803–808.

[15] Ladenhauf HN, Ardelean MA, Schimke C, Yankovic F, Schimpl G. Reduced bacterial colonisation of the glans penis after male circumcision in children—a prospective study. Journal of Pediatric Urology. 2013; 9: 1137–1144.

[16] Tobian AA, Serwadda D, Quinn TC, Kigozi G, Gravitt PE, Laeyendecker O, et al. Male circumcision for the prevention of HSV-2 and HPV infections and syphilis. The New England Journal of Medicine. 2009; 360: 1298–1309.

[17] Albero G, Castellsagué X, Giuliano AR, Bosch FX. Male circumcision and genital human papillomavirus: a systematic review and meta-analysis. Sexually Transmitted Diseases. 2012; 39: 104–113.

[18] Sezgin E. Artificial intelligence in healthcare: complementing, not replacing, doctors and healthcare providers. Digital Health. 2023; 9: 20552076231186520.

[19] Caglar U, Yildiz O, Meric A, Ayranci A, Gelmis M, Sarilar O, et al. Evaluating the performance of ChatGPT in answering questions related to pediatric urology. Journal of Pediatric Urology. 2024; 20: 26.e21–26.e25.

[20] Ozkan IA, Koklu M, Sert IU. Diagnosis of urinary tract infection based on artificial intelligence methods. Computer Methods and Programs in Biomedicine. 2018; 166: 51–59.

[21] Musheyev D, Pan A, Loeb S, Kabarriti AE. How well do artificial intelligence chatbots respond to the top search queries about urological malignancies? European Urology. 2024; 85: 13–16.

[22] Lin SY, Hsu YY, Ju SW, Yeh PC, Hsu WH, Kao CH. Assessing AI efficacy in medical knowledge tests: a study using Taiwan’s internal medicine exam questions from 2020 to 2023. Digital Health. 2024; 10: 20552076241291404.

[23] Azizoğlu M, Klyuev S. A comparative study on the question-answering proficiency of artificial intelligence models in bladder-related conditions: an evaluation of Gemini and ChatGPT 4.o. Medical Records. 2025; 7: 201–205.

[24] Alasker A, Alshathri N, Alsalamah S, Almansour N, Alsalamah F, Alghafees M, et al. ChatGPT vs. Gemini: which provides better information on bladder cancer? International Society of Urology Journal. 2025; 6: 34.

[25] Barlas İ, Tunç L. Quality of chatbot responses to the most popular questions regarding erectile dysfunction. Urology Research and Practice. 2025; 50: 253–260.

[26] Solano C, Tarazona N, Angarita GP, Medina AA, Ruiz S, Pedroza VM, et al. ChatGPT in urology: bridging knowledge and practice for tomorrow’s healthcare, a comprehensive review. Journal of Endourology. 2024; 38: 763–777.

[27] Alonso I, Oronoz M, Agerri R. MedExpQA: multilingual benchmarking of large language models for medical question answering. Artificial Intelligence in Medicine. 2024; 155: 102938.

[28] Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Amin M, et al. Toward expert-level medical question answering with large language models. Nature Medicine. 2025; 31: 943–950.

[29] Eppler M, Ganjavi C, Ramacciotti LS, Piazza P, Rodler S, Checcucci E, et al. Awareness and use of ChatGPT and large language models: a prospective cross-sectional global survey in urology. European Urology. 2024; 85: 146–153.

[30] Braga A, Nunes NC, Santos EN, Veiga ML, Braga AANM, de Abreu GE, et al. Use of ChatGPT in urology and its relevance in clinical practice: is it useful? International Brazilian Journal of Urology. 2024; 50: 192–198.

[31] Yu QX, Feng DC, Wu RC, Li DX. Auxiliary use of ChatGPT in surgical diagnosis and treatment—correspondence. International Journal of Surgery. 2024; 110: 617–618.

[32] Dhingra B, Cole JR, Eisenschlos JM, Gillick D, Eisenstein J, Cohen WW. Time-aware language models as temporal knowledge bases. Transactions of the Association for Computational Linguistics. 2022; 10: 257–273.

[33] Strobelt H, Webson A, Sanh V, Hoover B, Beyer J, Pfister H, et al. Interactive and visual prompt engineering for ad-hoc task adaptation with large language models. IEEE Transactions on Visualization and Computer Graphics. 2023; 29: 1146–1156.

[34] Ghanem D, Zhu AR, Kagabo W, Osgood G, Shafiq B. ChatGPT-4 knows its A B C D E but cannot cite its source. JBJS Open Access. 2024; 9: e24.00099.

[35] Ye Y, Zheng ED, Lan QL, Wu LC, Sun HY, Xu BB, et al. Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infection. Frontiers in Public Health. 2025; 13: 1566982.


Submission Turnaround Time

Top