The Responses of Artificial Intelligence to Questions About Urological Emergencies: A Comparison of 3 Different Large Language Models

Ubeyd Sungur; Yusuf Arıkan; Ahmet Tuğrul Türkay; Hakan Polat

doi:10.33719/nju1645041

Araştırma Makalesi

BibTex

RIS

Kaynak Göster

Ürolojik Acil Durumlarla İlgili Sorulara Yapay Zekanın Cevapları: 3 Farklı Büyük Dil Modelinin Karşılaştırılması

Yıl 2025, Cilt: 20 Sayı: 2, 89 - 96, 29.06.2025

Ubeyd Sungur , Yusuf Arıkan , Ahmet Tuğrul Türkay , Hakan Polat

https://doi.org/10.33719/nju1645041

Öz

Özet
Amaç: Bu çalışmanın amacı, yapay zeka teknolojisini kullanan üç farklı büyük dil modelinin (LLM) ürolojik acil durumlarla ilgili temel sorulara verdiği yanıtların doğruluğunu ve yeterliliğini karşılaştırmaktır.
Gereçler ve Yöntemler: Dokuz farklı ürolojik acil durum konusu belirlendi ve her konu için tanı ile ilgili iki, hastalık yönetimi ile ilgili üç ve komplikasyonlar ile ilgili iki olmak üzere toplam 63 temel soru belirlendi. Sorular, her biri farklı altyapılar kullanan üç farklı ücretsiz AI platformunda (ChatGPT-4, Google Gemini 2.0 Flash ve Meta Llama 3.2) İngilizce olarak soruldu ve yanıtlar belgelendi. Yanıtlar, yazarlar tarafından doğruluk ve yeterliliğe göre 1 ila 4 arasında bir ölçekte puanlandı ve sonuçlar istatistiksel analiz kullanılarak karşılaştırıldı.
Sonuçlar: Tüm soru-cevap çiftleri genel olarak değerlendirildiğinde, ChatGPT, Gemini ve Meta Llama'ya kıyasla biraz daha yüksek doğruluk oranları sergiledi; Ancak, gruplar arasında istatistiksel olarak anlamlı bir fark tespit edilmedi (sırasıyla 3,8 ± 0,5, 3,7 ± 0,6 ve 3,7 ± 0,5; p=0,146). Tanı, tedavi yönetimi ve komplikasyonlarla ilgili sorular ayrı ayrı değerlendirildiğinde, üç LLM arasında istatistiksel olarak anlamlı bir fark tespit edilmedi (sırasıyla p=0,338, p=0,289 ve p=0,407). Gemini tarafından sağlanan yalnızca bir yanıtın tamamen yanlış olduğu bulundu (%1,6). Üç platformda da tanı ile ilgili sorularda yanıltıcı veya yanlış yanıtlar gözlemlenmedi. Toplamda, ChatGPT için 2 soruda (%3,2), Gemini için 3 soruda (%4,7) ve Meta Llama için 2 soruda (%3,2) yanıltıcı yanıtlar gözlemlendi.
Sonuç: LLM'ler çoğunlukla, acil tedavinin kritik olduğu ürolojik acil durumlarla ilgili temel ve anlaşılır sorulara doğru sonuçlar sağlar. Bu çalışmada karşılaştırılan üç LLM'nin yanıtları arasında önemli bir fark gözlenmese de, bu teknolojinin gelişen doğası ve sınırlamaları göz önünde bulundurulduğunda yanıltıcı ve yanlış yanıtların varlığı dikkatlice değerlendirilmelidir.
Anahtar Kelimeler: ürolojik aciller, yapay zeka, büyük dil modelleri

Anahtar Kelimeler

Yapay zeka, Ürolojik aciller, Dil modelleri

Kaynakça

1. Rosenstein D, McAninch JW. Urologic emergencies. Med Clin North Am. 2004;88:495-518. https://doi.org/10.1016/S0025-7125(03)00190-1
2. Sharp VJ, Kieran K, Arlen AM. Testicular torsion: Diagnosis, evaluation, and management. Am Fam Physician. 2013;88:835-840. https://pubmed.ncbi.nlm.nih.gov/24364548/
3. Wagenlehner FM, Lichtenstern C, Rolfes C, et al. Diagnosis and management for urosepsis. Int J Urol. 2013;20:963-970. https://doi.org/10.1111/iju.12200
4. Stoumpos AI, Kitsios F, Talias MA. Digital Transformation in Healthcare: Technology Acceptance and Its Applications. Int J Environ Res Public Health. 2023;20:3407. https://doi.org/10.3390/ijerph20043407
5. Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25:44-56. https://doi.org/10.1038/s41591-018-0300-7
6. Jiang F, Jiang Y, Zhi H, et al. Artificial intelligence in healthcare: Past, present and future. Stroke Vasc Neurol. 2017;2:230-243. https://doi.org/10.1136/svn-2017-000101
7. Alowais SA, Alghamdi SS, Alsuhebany N, et al. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Med Educ. 2023;23:689. https://doi.org/10.1186/s12909-023-04698-z
8. Wang D, Zhang S. Large language models in medical and healthcare fields: applications, advances, and challenges. Artif Intell Rev. 2024;57:1-48. https://doi.org/10.1007/s10462-024-10921-0
9. Yi H, Wang D, Wu X, et al. Analysis of factors associated with delayed diagnosis and treatment of testicular torsion in 1005 cases from Chongqing city, China: a cross-sectional study. Sci Rep. 2023;13:1-10. https://doi.org/10.1038/s41598-023-49820-9
10. Hsiao CY, Chen TH, Lee YC, et al. Urolithiasis Is a Risk Factor for Uroseptic Shock and Acute Kidney Injury in Patients With Urinary Tract Infection. Front Med. 2019;6:288. https://doi.org/10.3389/fmed.2019.00288
11. Haas CR, Li G, Hyams ES, Shah O. Delayed Decompression of Obstructing Stones with Urinary Tract Infection is Associated with Increased Odds of Death. J Urol. 2020;204:1256-1262. https://doi.org/10.1097/JU.0000000000001182
12. Kuroiwa T, Sarcon A, Ibara T, et al. The Potential of ChatGPT as a Self-Diagnostic Tool in Common Orthopedic Diseases: Exploratory Study. J Med Internet Res. 2023;25:e47621. https://doi.org/10.2196/47621
13. Yau JYS, Saadat S, Hsu E, et al. Accuracy of Prospective Assessments of 4 Large Language Model Chatbot Responses to Patient Questions About Emergency Care: Experimental Comparative Study. J Med Internet Res. 2024;26:e60291. https://doi.org/10.2196/60291
14. Scott M, Muncey W, Seranio N, et al. Assessing Artificial Intelligence-Generated Responses to Urology Patient In-Basket Messages. Urol Pract. 2024;11:793-798. https://doi.org/10.1097/UPJ.0000000000000637
15. Busch F, Hoffmann L, Rueger C, et al. Current applications and challenges in large language models for patient care: a systematic review. Commun Med. 2025;5:26. https://doi.org/10.1038/s43856-024-00717-2

The Responses of Artificial Intelligence to Questions About Urological Emergencies: A Comparison of 3 Different Large Language Models

Yıl 2025, Cilt: 20 Sayı: 2, 89 - 96, 29.06.2025

Ubeyd Sungur , Yusuf Arıkan , Ahmet Tuğrul Türkay , Hakan Polat

https://doi.org/10.33719/nju1645041

Öz

Objective: This study aimed to compare the accuracy and adequacy of responses provided by three different large language models (LLMs) utilizing artificial intelligence technology to fundamental questions related to urological emergencies.
Material and Methods: Nine distinct urological emergency topics were identified, and a total of 63 fundamental questions were formulated for each topic, including two related to diagnosis, three related to disease management, and two related to complications. The questions were posed in English on three different free AI platforms (ChatGPT-4, Google Gemini 2.0 Flash, and Meta Llama 3.2), each utilizing different infrastructures, and responses were documented. The answers were scored by the authors on a scale of 1 to 4 based on accuracy and adequacy, and the results were compared using statistical analysis.
Results: When all question-answer pairs were evaluated overall, ChatGPT exhibited slightly higher accuracy rates compared to Gemini and Meta Llama; however, no statistically significant differences were detected among the groups (3.8 ± 0.5, 3.7 ± 0.6, and 3.7 ± 0.5, respectively; p=0.146). When questions related to diagnosis, treatment management, and complications were evaluated separately, no statistically significant differences were detected among the three LLMs (p=0.338, p=0.289, and p=0.407, respectively). Only one response provided by Gemini was found to be completely incorrect (1.6%). No misleading or wrong answers were observed in the diagnosis-related questions across all three platforms. In total, misleading answers were observed in 2 questions (3.2%) for ChatGPT, three questions (4.7%) for Gemini, and two questions (3.2%) for Meta Llama.
Conclusion: LLMs predominantly provide accurate results to basic and straightforward questions related to urological emergencies, where prompt treatment is critical. Although no significant differences were observed among the responses of the three LLMs compared in this study, the presence of misleading and incorrect answers should be carefully considered, given the evolving nature and limitations of this technology.

Anahtar Kelimeler

Urology, Emergencies, Artificial intelligence

Etik Beyan

Chat-GPT, Google Gemini, and Meta Llama are publicly available artificial intelligence models, and there are no animal or human research participants in our study. For these reasons, our study did not require ethics committee approval.

Destekleyen Kurum

None.

Kaynakça

1. Rosenstein D, McAninch JW. Urologic emergencies. Med Clin North Am. 2004;88:495-518. https://doi.org/10.1016/S0025-7125(03)00190-1
2. Sharp VJ, Kieran K, Arlen AM. Testicular torsion: Diagnosis, evaluation, and management. Am Fam Physician. 2013;88:835-840. https://pubmed.ncbi.nlm.nih.gov/24364548/
3. Wagenlehner FM, Lichtenstern C, Rolfes C, et al. Diagnosis and management for urosepsis. Int J Urol. 2013;20:963-970. https://doi.org/10.1111/iju.12200
4. Stoumpos AI, Kitsios F, Talias MA. Digital Transformation in Healthcare: Technology Acceptance and Its Applications. Int J Environ Res Public Health. 2023;20:3407. https://doi.org/10.3390/ijerph20043407
5. Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25:44-56. https://doi.org/10.1038/s41591-018-0300-7
6. Jiang F, Jiang Y, Zhi H, et al. Artificial intelligence in healthcare: Past, present and future. Stroke Vasc Neurol. 2017;2:230-243. https://doi.org/10.1136/svn-2017-000101
7. Alowais SA, Alghamdi SS, Alsuhebany N, et al. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Med Educ. 2023;23:689. https://doi.org/10.1186/s12909-023-04698-z
8. Wang D, Zhang S. Large language models in medical and healthcare fields: applications, advances, and challenges. Artif Intell Rev. 2024;57:1-48. https://doi.org/10.1007/s10462-024-10921-0
9. Yi H, Wang D, Wu X, et al. Analysis of factors associated with delayed diagnosis and treatment of testicular torsion in 1005 cases from Chongqing city, China: a cross-sectional study. Sci Rep. 2023;13:1-10. https://doi.org/10.1038/s41598-023-49820-9
10. Hsiao CY, Chen TH, Lee YC, et al. Urolithiasis Is a Risk Factor for Uroseptic Shock and Acute Kidney Injury in Patients With Urinary Tract Infection. Front Med. 2019;6:288. https://doi.org/10.3389/fmed.2019.00288
11. Haas CR, Li G, Hyams ES, Shah O. Delayed Decompression of Obstructing Stones with Urinary Tract Infection is Associated with Increased Odds of Death. J Urol. 2020;204:1256-1262. https://doi.org/10.1097/JU.0000000000001182
12. Kuroiwa T, Sarcon A, Ibara T, et al. The Potential of ChatGPT as a Self-Diagnostic Tool in Common Orthopedic Diseases: Exploratory Study. J Med Internet Res. 2023;25:e47621. https://doi.org/10.2196/47621
13. Yau JYS, Saadat S, Hsu E, et al. Accuracy of Prospective Assessments of 4 Large Language Model Chatbot Responses to Patient Questions About Emergency Care: Experimental Comparative Study. J Med Internet Res. 2024;26:e60291. https://doi.org/10.2196/60291
14. Scott M, Muncey W, Seranio N, et al. Assessing Artificial Intelligence-Generated Responses to Urology Patient In-Basket Messages. Urol Pract. 2024;11:793-798. https://doi.org/10.1097/UPJ.0000000000000637
15. Busch F, Hoffmann L, Rueger C, et al. Current applications and challenges in large language models for patient care: a systematic review. Commun Med. 2025;5:26. https://doi.org/10.1038/s43856-024-00717-2

Toplam 15 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Konular	Üroloji
Bölüm	Araştırma Makalesi
Yazarlar	Ubeyd Sungur 0000-0002-8910-9859 Yusuf Arıkan 0000-0003-0823-7400 Ahmet Tuğrul Türkay 0009-0005-4210-8954 Hakan Polat 0000-0003-1525-1243
Yayımlanma Tarihi	29 Haziran 2025
Gönderilme Tarihi	22 Şubat 2025
Kabul Tarihi	29 Mayıs 2025
Yayımlandığı Sayı	Yıl 2025 Cilt: 20 Sayı: 2

Kaynak Göster

Vancouver	Sungur U, Arıkan Y, Türkay AT, Polat H. The Responses of Artificial Intelligence to Questions About Urological Emergencies: A Comparison of 3 Different Large Language Models. New J Urol. 2025;20(2):89-96.

Kapak Resmi İndir

Makale Dosyaları

Tam Metin