Türkçe Görüntü Altyazılama: Mevcut Çalışmalar, Uygulamalar, Veri Setleri, Metrikler ve Gelecekteki Olası Eğilimler Üzerine Bir İnceleme

Abbas Memiş

Derleme

Turkish Image Captioning: A Review of Current Studies, Applications, Datasets, Metrics and Potential Future Trends

Yıl 2025, Cilt: 15 Sayı: 2, 81 - 97, 30.05.2025

Abbas Memiş

Öz

Over the past decade, groundbreaking advances in computer vision and machine learning have led to an era in which intelligent systems have become more widespread, diverse and impactful on human life. One of the major research areas accelerated and driven by these advances has been automatic image captioning. In recent years, not only a remarkable amount of work on automatic image captioning has been proposed for a wide range of languages, but there have also been significant and promising advances in Turkish image captioning. In the context of these notable and promising achievements in Turkish image captioning, a comprehensive review of existing studies and applications in the field of Turkish image captioning is reported in this article. In addition to current studies in the literature, application domains and datasets are also covered. The paper also includes common and standard metrics used to measure the captioning performance of image captioning systems. Furthermore, possible future trends and potential developments for Turkish captioning are discussed from the authors’ perspective and shared.

Anahtar Kelimeler

Image captioning, Turkish image captioning, image captioning domains, image-caption datasets, image captioning metrics

Kaynakça

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
[2] S. Yıldız, O. Aydemir, A. Memiş, and S. Varlı, “A turnaround control system to automatically detect and monitor the time stamps of ground service actions in airports: A deep learning and computer vision based approach,” Engineering Applications of Artificial Intelligence, vol. 114, p. 105032, 2022, doi:10.1016/j.engappai.2022.105032.
[3] X. Chen, X. Wang, K. Zhang, K.-M. Fung, T. C. Thai, K. Moore, R. S. Mannel, H. Liu, B. Zheng, and Y. Qiu, “Recent advances and clinical applications of deep learning in medical image analysis,” Medical image analysis, vol. 79, p. 102444, 2022, doi:10.1016/j.media.2022.102444.
[4] A. M. Ozbayoglu, M. U. Gudelek, and O. B. Sezer, “Deep learning for financial applications: A survey,” Applied soft computing, vol. 93, p. 106384, 2020, doi:10.1016/j.asoc.2020.106384.
[5] X. Zhifeng, “Virtual entertainment robot based on artificial intelligence image capture system for sports and fitness posture recognition,” Entertainment Computing, vol. 52, p. 100793, 2025, doi:10.1016/j.entcom.2024.100793.
[6] H. Wang, H. Wu, Z. He, L. Huang, and K. W. Church, “Progress in machine translation,” Engineering, vol. 18, pp. 143–153, 2022, doi:10.1016/j.eng.2021.03.023.
[7] R. K. Kaliyar, A. Goswami, and P. Narang, “FakeBERT: Fake news detection in social media with a BERT-based deep learning approach,” Multimedia tools and applications, vol. 80, no. 8, pp. 11 765–11 788, 2021, doi:10.1007/s11042-020-10183-2.
[8] Y. Yang, K. Zhang, and P. Kannan, “Identifying market structure: A deep network representation learning of social engagement,” Journal of Marketing, vol. 86, no. 4, pp. 37–56, 2022, doi:10.1177/00222429211033585.
[9] M. Soori, B. Arezoo, and R. Dastres, “Artificial intelligence, machine learning and deep learning in advanced robotics, a review,” Cognitive Robotics, vol. 3, pp. 54–70, 2023, doi:10.1016/j.cogr.2023.04.001.
[10] M. Ozkan-Ozay, E. Akin, Ö. Aslan, S. Kosunalp, T. Iliev, I. Stoyanov, and I. Beloev, “A comprehensive survey: Evaluating the efficiency of artificial intelligence and machine learning techniques on cyber security solutions,” IEEE Access, 2024, doi:10.1109/ACCESS.2024.3355547.
[11] Y. Matsuzaka and R. Yashiro, “Ai-based computer vision techniques and expert systems,” AI, vol. 4, no. 1, pp. 289–302, 2023, doi:10.3390/ai4010013.
[12] S. V. Mahadevkar, B. Khemani, S. Patil, K. Kotecha, D. R. Vora, A. Abraham, and L. A. Gabralla, “A review on machine learning styles in computer vision—techniques and future directions,” Ieee Access, vol. 10, pp. 107 293–107 329, 2022, doi:10.1109/ACCESS.2022.3209825.
[13] R. Kaur and S. Singh, “A comprehensive review of object detection with deep learning,” Digital Signal Processing, vol. 132, p. 103812, 2023, doi:10.1016/j.dsp.2022.103812.
[14] D. Meimetis, I. Daramouskas, I. Perikos, and I. Hatzilygeroudis, “Real-time multiple object tracking using deep learning methods,” Neural Computing and Applications, vol. 35, no. 1, pp. 89–118, 2023, doi:10.1007/s00521-021-06391-y.
[15] Y. Mo, Y. Wu, X. Yang, F. Liu, and Y. Liao, “Review the state-of-the-art technologies of semantic segmentation based on deep learning,” Neurocomputing, vol. 493, pp. 626–646, 2022, doi:10.1016/j.neucom.2022.01.005.
[16] S. Yıldız, “Turkish scene text recognition: Introducing extensive real and synthetic datasets and a novel recognition model,” Engineering Science and Technology, an International Journal, vol. 60, p. 101881, 2024, doi:10.1016/j.jestch.2024.101881.
[17] D. Sarvamangala and R. V. Kulkarni, “Convolutional neural networks in medical image understanding: a survey,” Evolutionary intelligence, vol. 15, no. 1, pp. 1–22, 2022, doi:10.1007/s12065-020-00540-3.
[18] L. Xu, Q. Tang, J. Lv, B. Zheng, X. Zeng, and W. Li, “Deep image captioning: A review of methods, trends and future challenges,” Neurocomputing, vol. 546, p. 126287, 2023, doi:10.1016/j.neucom.2023.126287.
[19] A. Verma, A. K. Yadav, M. Kumar, and D. Yadav, “Automatic image caption generation using deep learning,” Multimedia Tools and Applications, vol. 83, no. 2, pp. 5309–5325, 2024, doi:10.1007/s11042-023-15555-y.
[20] L. Agarwal and B. Verma, “From methods to datasets: A survey on Image-Caption Generators,” Multimedia Tools and Applications, vol. 83, no. 9, pp. 28 077–28 123, 2024, doi:10.1007/s11042-023-16560-x.
[21] S. Sreela and S. M. Idicula, “A Systematic Survey of Automatic Image Description Generation Systems,” International Journal of Image and Graphics, p. 2650002, 2024, doi:10.1142/S0219467826500026.
[22] W. Li, Z. Qu, H. Song, P. Wang, and B. Xue, “The traffic scene understanding and prediction based on image captioning,” IEEE Access, vol. 9, pp. 1420–1427, 2020, doi:10.1109/ACCESS.2020.3047091.
[23] Y. Ming, N. Hu, C. Fan, F. Feng, J. Zhou, and H. Yu, “Visuals to text: A comprehensive review on automatic image captioning,” IEEE/CAA Journal of Automatica Sinica, vol. 9, no. 8, pp. 1339–1365, 2022, doi:10.1109/JAS.2022.105734.
[24] M. Z. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga, “A comprehensive survey of deep learning for image captioning,” ACM Computing Surveys (CsUR), vol. 51, no. 6, pp. 1–36, 2019, doi:10.1145/3295748.
[25] D. Sharma, C. Dhiman, and D. Kumar, “Evolution of visual data captioning methods, datasets, and evaluation metrics: A comprehensive survey,” Expert Systems with Applications, vol. 221, p. 119773, 2023, doi:10.1016/j.eswa.2023.119773.
[26] F. Chen, X. Li, J. Tang, S. Li, and T. Wang, “A survey on recent advances in image captioning,” in Journal of Physics: Conference Series, vol. 1914, no. 1. IOP Publishing, 2021, p. 012053, doi:10.1088/1742-6596/1914/1/012053.
[27] H. Sharma and D. Padha, “A comprehensive survey on image captioning: from handcrafted to deep learning-based techniques, a taxonomy and open research issues,” Artificial Intelligence Review, vol. 56, no. 11, pp. 13 619–13 661, 2023, doi:10.1007/s10462-023-10488-2.
[28] X. Huang, K. Lu, S. Wang, J. Lu, X. Li, and R. Zhang, “Understanding remote sensing imagery like reading a text document: What can remote sensing image captioning offer?” International Journal of Applied Earth Observation and Geoinformation, vol. 131, p. 103939, 2024, doi:10.1016/j.jag.2024.103939.
[29] Y. Li, X. Zhang, X. Cheng, X. Tang, and L. Jiao, “Learning consensus-aware semantic knowledge for remote sensing image captioning ,” Pattern Recognition, vol. 145, p. 109893, 2024, doi:10.1016/j.patcog.2023.109893.
[30] K. Zhao and W. Xiong, “Exploring region features in remote sensing image captioning,” International Journal of Applied Earth Observation and Geoinformation, vol. 127, p. 103672, 2024, doi:10.1016/j.jag.2024.103672.
[31] C. Chunseong Park, B. Kim, and G. Kim, “Attend to you: Personalized image captioning with context sequence memory networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 895–903, doi:10.1109/CVPR.2017.681.
[32] N. Sanguansub, P. Kamolrungwarakul, S. Poopair, K. Techaphonprasit, and T. Siriborvornratanakul, “Song lyrics recommendation for social media captions using image captioning, image emotion, and caption-lyric matching via universal sentence embedding,” Social Network Analysis and Mining, vol. 13, no. 1, p. 95, 2023, doi:10.1007/s13278-023-01097-6.
[33] J.-H. Wang, C.-W. Huang, and M. Norouzi, “Improving Rumor Detection by Image Captioning and Multi-Cell Bi-RNN With Self-Attention in Social Networks,” International Journal of Data Warehousing and Mining (IJDWM), vol. 18, no. 1, pp. 1–17, 2022, doi:10.4018/IJDWM.313189.
[34] A. Tran, A. Mathews, and L. Xie, “Transform and tell: Entity-aware news image captioning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 13 035–13 045, doi:10.1109/CVPR42600.2020.01305.
[35] Z. Zhang, H. Zhang, J. Wang, Z. Sun, and Z. Yang, “Generating news image captions with semantic discourse extraction and contrastive style-coherent learning,” Computers and Electrical Engineering, vol. 104, p. 108429, 2022, doi:10.1016/j.compeleceng.2022.108429.
[36] J. Chen and H. Zhuge, “A news image captioning approach based on multimodal pointer-generator network,” Concurrency and Computation: Practice and Experience, vol. 34, no. 7, p. e5721, 2022, doi:10.1002/cpe.5721.
[37] A. Selivanov, O. Y. Rogov, D. Chesakov, A. Shelmanov, I. Fedulova, and D. V. Dylov, “Medical image captioning via generative pretrained transformers,” Scientific Reports, vol. 13, no. 1, p. 4171, 2023, doi:10.1038/s41598-023-31223-5.
[38] G. Reale-Nosei, E. Amador-Domínguez, and E. Serrano, “From vision to text: A comprehensive review of natural image captioning in medical diagnosis and radiology report generation,” Medical Image Analysis, p. 103264, 2024, doi:10.1016/j.media.2024.103264.
[39] J. Pavlopoulos, V. Kougia, I. Androutsopoulos, and D. Papamichail, “Diagnostic captioning: a survey,” Knowledge and Information Systems, vol. 64, no. 7, pp. 1691–1722, 2022, doi:10.1007/s10115-022-01684-7.
[40] D. H. Fudholi, Y. Windiatmoko, N. Afrianto, P. E. Susanto, M. Suyuti, A. F. Hidayatullah, and R. Rahmadi, “Image captioning with attention for smart local tourism using efficientnet,” in IOP Conference Series: Materials Science and Engineering, vol. 1077, no. 1. IOP Publishing, 2021, p. 012038, doi:10.1088/1757-899X/1077/1/012038.
[41] S. Watcharabutsarakham, S. Marukatat, K. Kiratiratanapruk, and P. Temniranrat, “Image Captioning for Thai Cultures,” in 2022 17th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP). IEEE, 2022, pp. 1–5, doi:10.1109/iSAI-NLP56921.2022.9960251.
[42] Y. Bounab, M. Oussalah, and A. Ferdenache, “Reconciling image captioning and user’s comments for urban tourism,” in 2020 Tenth International Conference on Image Processing Theory, Tools and Applications (IPTA). IEEE, 2020, pp. 1–6, doi:10.1109/IPTA50016.2020.9286602.
[43] B. Zheng, F. Liu, M. Zhang, T. Zhou, S. Cui, Y. Ye, and Y. Guo, “Image captioning for cultural artworks: a case study on ceramics,” Multimedia Systems, vol. 29, no. 6, pp. 3223–3243, 2023, doi:10.1007/s00530-023-01178-8.
[44] E. Cetinic, “Iconographic image captioning for artworks,” in Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10–15, 2021, Proceedings, Part III. Springer, 2021, pp. 502–516, doi:10.1007/978-3-030-68796-0_36.
[45] Y. Lu, C. Guo, X. Dai, and F.-Y. Wang, “Artcap: A dataset for image captioning of fine art paintings,” IEEE Transactions on Computational Social Systems, vol. 11, no. 1, pp. 576–587, 2022, doi:10.1109/TCSS.2022.3223539.
[46] J. Yi, C. Wu, X. Zhang, X. Xiao, Y. Qiu, W. Zhao, T. Hou, and D. Cao, “Micer: a pre-trained encoder–decoder architecture for molecular image captioning,” Bioinformatics, vol. 38, no. 19, pp. 4562–4572, 2022, doi:10.1093/bioinformatics/btac545.
[47] A. Lozano, M. W. Sun, J. Burgess, L. Chen, J. J. Nirschl, J. Gu, I. Lopez, J. Aklilu, A. W. Katzer, C. Chiu et al., “BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature,” arXiv preprint arXiv:2501.07171, 2025, doi:10.48550/arXiv.2501.07171.
[48] D. Tran, N. T. Pham, N. Nguyen, and B. Manavalan, “ Mol2Lang-VLM: Vision-and Text-Guided Generative Pre-trained Language Models for Advancing Molecule Captioning through Multimodal Fusion,” in Proceedings of the 1st Workshop on Language+ Molecules (L+M 2024), 2024, pp. 97–102, doi:10.18653/v1/2024.langmol-1.12.
[49] H. Sharma and D. Padha, “Domain-specific image captioning: a comprehensive review,” International Journal of Multimedia Information Retrieval, vol. 13, no. 2, pp. 1–27, 2024, doi:10.1007/s13735-024-00328-6.
[50] X. He and L. Deng, “Deep learning for image-to-text generation: A technical overview,” IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 109–116, 2017, doi:10.1109/MSP.2017.2741510.
[51] S. Bai and S. An, “A survey on automatic image caption generation,” Neurocomputing, vol. 311, pp. 291–304, 2018, doi:10.1016/j.neucom.2018.05.080.
[52] Y. Yang, C. Teo, H. Daumé III, and Y. Aloimonos, “Corpus-guided sentence generation of natural images,” in Proceedings of the 2011 conference on empirical methods in natural language processing, 2011, pp. 444–454.
[53] R. Mason and E. Charniak, “Nonparametric method for data-driven image captioning,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2014, pp. 592–598.
[54] T. Ghandi, H. Pourreza, and H. Mahyar, “Deep learning approaches on image captioning: A review,” ACM Computing Surveys, vol. 56, no. 3, pp. 1–39, 2023, doi:10.1145/3617592.
[55] J. Aneja, A. Deshpande, and A. G. Schwing, “Convolutional image captioning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5561–5570, doi:10.1109/CVPR.2018.00583.
[56] K. R. Suresh, A. Jarapala, and P. Sudeep, “Image captioning encoder–decoder models using cnn-rnn architectures: A comparative study,” Circuits, Systems, and Signal Processing, vol. 41, no. 10, pp. 5719–5742, 2022, doi:10.1007/s00034-022-02050-2.
[57] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning with semantic attention,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4651–4659, doi:10.1109/CVPR.2016.503.
[58] L. Huang, W. Wang, J. Chen, and X.-Y. Wei, “Attention on attention for image captioning,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 4634–4643, doi:10.1109/ICCV.2019.00473.
[59] X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene graphs for image captioning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10 685–10 694, doi:10.1109/CVPR.2019.01094.
[60] N. Xu, A.-A. Liu, J. Liu, W. Nie, and Y. Su, “Scene graph captioner: Image captioning based on structural visual representation,” Journal of Visual Communication and Image Representation, vol. 58, pp. 477–485, 2019, doi:10.1016/j.jvcir.2018.12.027.
[61] X. Li and S. Jiang, “Know more say less: Image captioning based on scene graphs,” IEEE Transactions on Multimedia, vol. 21, no. 8, pp. 2117–2130, 2019, doi:10.1109/TMM.2019.2896516.
[62] X. Xiao, L. Wang, K. Ding, S. Xiang, and C. Pan, “Deep hierarchical encoder-decoder network for image captioning,” IEEE Transactions on Multimedia, vol. 21, no. 11, pp. 2942–2956, 2019, doi:10.1109/TMM.2019.2915033.
[63] R. Castro, I. Pineda, W. Lim, and M. E. Morocho-Cayamcela, “Deep learning approaches based on transformer architectures for image captioning tasks,” IEEE Access, vol. 10, pp. 33 679–33 694, 2022, doi:10.1109/ACCESS.2022.3161428.
[64] M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-memory transformer for image captioning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 578–10 587, doi:10.1109/CVPR42600.2020.01059.
[65] K. Qian, Y. Pan, H. Xu, and L. Tian, “Transformer model incorporating local graph semantic attention for image caption,” The Visual Computer, pp. 1–12, 2023, doi:10.1007/s00371-023-03180-7.
[66] Y. Feng, L. Ma, W. Liu, and J. Luo, “Unsupervised image captioning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4125–4134, doi:10.1109/CVPR.2019.00425.
[67] Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li, “Deep reinforcement learning-based image captioning with embedding reward,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 290–298, doi:10.1109/CVPR.2017.128.
[68] X. Hu, Z. Gan, J. Wang, Z. Yang, Z. Liu, Y. Lu, and L. Wang, “Scaling up vision-language pre-training for image captioning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 17 980–17 989, doi:10.1109/CVPR52688.2022.01745.
[69] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[70] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation, parallel distributed processing, explorations in the microstructure of cognition,” ed. de rumelhart and j. mcclelland. vol. 1. 1986, Biometrika, vol. 71, no. 599-607, p. 6, 1986, doi:10.7551/mitpress/4943.003.0128.
[71] A. Tsantekidis, N. Passalis, and A. Tefas, “Recurrent neural networks,” in Deep learning for robot perception and cognition. Elsevier, 2022, pp. 101–115, doi:10.1016/B978-0-32-385787-1.00010-5.
[72] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997, doi:10.1162/neco.1997.9.8.1735.
[73] A. Sherstinsky, “Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network,” Physica D: Nonlinear Phenomena, vol. 404, p. 132306, 2020, doi:10.1016/j.physd.2019.132306.
[74] K. Cho, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014, doi:10.48550/arXiv.1406.1078.
[75] O. Ondeng, H. Ouma, and P. Akuon, “A review of transformer-based approaches for image captioning,” Applied Sciences, vol. 13, no. 19, p. 11103, 2023, doi:10.3390/app131911103.
[76] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, etT. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020, doi:10.48550/arXiv.2010.11929.
[77] M. J. Parseh and S. Ghadiri, “Graph-based image captioning with semantic and spatial features,” Signal Processing: Image Communication, vol. 133, p. 117273, 2025, doi:10.1016/j.image.2025.117273.
[78] J. Jia, X. Ding, S. Pang, X. Gao, X. Xin, R. Hu, and J. Nie, “Image captioning based on scene graphs: A survey,” Expert Systems with Applications, vol. 231, p. 120698, 2023, doi:10.1016/j.eswa.2023.120698.
[79] A. S. Al-Shamayleh, O. Adwan, M. A. Alsharaiah, A. H. Hussein, Q. M. Kharma, and C. I. Eke, “A comprehensive literature review on image captioning methods and metrics based on deep learning technique,” Multimedia Tools and Applications, vol. 83, no. 12, pp. 34 219–34 268, 2024, doi:10.1007/s11042-024-18307-8.
[80] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755, doi:10.1007/978-3-319-10602-1_48.
[81] M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics,” Journal of Artificial Intelligence Research, vol. 47, pp. 853–899, 2013, doi:10.1613/jair.3994.
[82] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 67–78, 2014, doi:10.1162/tacl_a_00166.
[83] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3128–3137, doi:10.1109/TPAMI.2016.2598339.
[84] M. Grubinger, P. Clough, H. Müller, and T. Deselaers, “The iapr tc-12 benchmark: A new evaluation resource for visual information systems,” in International workshop ontoImage, vol. 2, 2006. [85] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, “Collecting image annotations using amazon’s mechanical turk,” in Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk, 2010, pp. 139–147.
[86] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth, “Every picture tells a story: Generating sentences from images,” in Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV 11. Springer, 2010, pp. 15–29, doi:10.1007/978-3-642-15561-1_2.
[87] V. Ordonez, G. Kulkarni, and T. Berg, “Im2text: Describing images using 1 million captioned photographs,” Advances in neural information processing systems, vol. 24, 2011.
[88] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li, “Yfcc100m: The new data in multimedia research,” Communications of the ACM, vol. 59, no. 2, pp. 64–73, 2016, doi:10.1145/2812802.
[89] A. Koochali, S. Kalkowski, A. Dengel, D. Borth, and C. Schulze, “Which languages do people speak on flickr? a language and geo-location study of the yfcc100m dataset,” in Proceedings of the 2016 ACM Workshop on Multimedia COMMONS, 2016, pp. 35–42, doi:10.1145/2983554.2983560.
[90] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy, “Generation and comprehension of unambiguous object descriptions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 11–20, doi:10.1109/CVPR.2016.9.
[91] C. Gan, Z. Gan, X. He, J. Gao, and L. Deng, “Stylenet: Generating attractive visual captions with styles,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3137–3146, doi:10.1109/CVPR.2017.108.
[92] P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2556–2565, doi:10.18653/v1/P18-1238.
[93] H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson, “Nocaps: Novel object captioning at scale,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 8948–8957, doi:10.1109/ICCV.2019.00904.
[94] O. Sidorov, R. Hu, M. Rohrbach, and A. Singh, “Textcaps: a dataset for image captioning with reading comprehension,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. Springer, 2020, pp. 742–758, doi:10.1007/978-3-030-58536-5_44.
[95] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 3558–3568, doi:10.1109/CVPR46437.2021.00356.
[96] Y. Yoshikawa, Y. Shigeto, and A. Takeuchi, “Stair captions: Constructing a large-scale japanese image caption dataset,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2017, pp. 417–421, doi:10.48550/arXiv.1705.00823.
[97] A. Alsayed, T. M. Qadah, and M. Arif, “A performance analysis of transformer-based deep learning models for arabic image captioning,” Journal of King Saud University-Computer and Information Sciences, vol. 35, no. 9, p. 101750, 2023, doi:10.1016/j.jksuci.2023.101750.
[98] B. Das, R. Pal, M. Majumder, S. Phadikar, and A. A. Sekh, “A visual attention-based model for bengali image captioning,” SN Computer Science, vol. 4, no. 2, p. 208, 2023, doi:10.1007/s42979-023-01671-x.
[99] A. Rathi, “Deep learning apporach for image captioning in hindi language,” in 2020 international conference on computer, electrical & communication engineering (ICCECE). IEEE, 2020, pp. 1–8, doi:10.1109/ICCECE48148.2020.9223087.
[100] A. Mathews, L. Xie, and X. He, “Senticap: Generating image descriptions with sentiments,” in Proceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016, doi:10.1609/aaai.v30i1.10475.
[101] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International journal of computer vision, vol. 123, pp. 32–73, 2017, doi:10.1007/s11263-016-0981-7.
[102] B. Qu, X. Li, D. Tao, and X. Lu, “Deep semantic understanding of high resolution remote sensing image,” in 2016 International conference on computer, information and telecommunication systems (Cits). IEEE, 2016, pp. 1–5, doi:10.1109/CITS.2016.7546397.
[103] D. Demner-Fushman, M. D. Kohli, M. B. Rosenman, S. E. Shooshan, L. Rodriguez, S. Antani, G. R. Thoma, and C. J. McDonald, “Preparing a collection of radiology examinations for distribution and retrieval,” Journal of the American Medical Informatics Association, vol. 23, no. 2, pp. 304–310, 2016, doi:10.1093/jamia/ocv080.
[104] J. Vaishnavi and V. Narmatha, “Video captioning–a survey,” Multimedia Tools and Applications, pp. 1–32, 2024, doi:10.1007/s11042-024-18886-6.
[105] M. Abdar, M. Kollati, S. Kuraparthi, F. Pourpanah, D. McDuff, M. Ghavamzadeh, S. Yan, A. Mohamed, A. Khosravi, E. Cambria et al., “A review of deep learning for video captioning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, doi:10.1109/TPAMI.2024.3522295.
[106] M. S. Wajid, H. Terashima-Marin, P. Najafirad, and M. A. Wajid, “Deep learning and knowledge graph for image/video captioning: A review of datasets, evaluation metrics, and methods,” Engineering Reports, vol. 6, no. 1, p. e12785, 2024, doi:10.1002/eng2.12785.
[107] D. Chen and W. B. Dolan, “Collecting highly parallel data for paraphrase evaluation,” in Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, 2011, pp. 190–200.
[108] J. Xu, T. Mei, T. Yao, and Y. Rui, “Msr-vtt: A large video description dataset for bridging video and language,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5288–5296, doi:10.1109/CVPR.2016.571.
[109] M. Ravinder, V. Gupta, K. Arora, A. Ranjan, and Y.-C. Hu, “Video-Captioning Evaluation Metric for Segments (VEMS): A Metric for Segment-level Evaluation of Video Captions with Weighted Frames,” Multimedia Tools and Applications, vol. 83, no. 16, pp. 47 699–47 733, 2024, doi:10.1007/s11042-023-17328-z.
[110] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318, doi:10.3115/1073083.1073135.
[111] S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72.
[112] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out, 2004, pp. 74–81.
[113] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4566–4575, doi:10.1109/CVPR.2015.7299087.
[114] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. Springer, 2016, pp. 382–398, doi:10.1007/978-3-319-46454-1_24.
[115] J. Giménez and L. Màrquez, “Linguistic features for automatic evaluation of heterogenous mt systems,” in Proceedings of the Second Workshop on Statistical Machine Translation, 2007, pp. 256–264.
[116] Y. Li, C. Wu, L. Li, Y. Liu, and J. Zhu, “Caption generation from road images for traffic scene modeling,” IEEE Transactions on intelligent transportation systems, vol. 23, no. 7, pp. 7805–7816, 2021, doi:10.1109/TITS.2021.3072970.
[117] H. Zhang, C. Xu, B. Xu, M. Jian, H. Liu, and X. Li, “TSIC-CLIP: Traffic Scene Image Captioning Model Based on Clip,” Information Technology and Control, vol. 53, no. 1, pp. 98–114, 2024, doi:10.5755/j01.itc.53.1.35095.
[118] Q. Cheng, H. Huang, Y. Xu, Y. Zhou, H. Li, and Z. Wang, “Nwpu-captions dataset and mlca-net for remote sensing image captioning,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–19, 2022, doi:10.1109/TGRS.2022.3201474.
[119] M. Nivedita, P. Chandrashekar, S. Mahapatra, Y. A. V. Phamila, and S. K. Selvaperumal, “Image captioning for video surveillance system using neural networks,” International Journal of Image and Graphics, vol. 21, no. 04, p. 2150044, 2021, doi:10.1142/S0219467821500443.
[120] P. Liwen, “Image Captioning-based Smart Phone Forensics Analysis Model,” in Proceedings of the 2023 6th International Conference on Big Data Technologies, 2023, pp. 372–376, doi:10.1145/3627377.3627435.
[121] M. Jeon, J. Ko, and K. Cheoi, “Enhancing Surveillance Systems: Integration of Object, Behavior, and Space Information in Captions for Advanced Risk Assessment,” Sensors, vol. 24, no. 1, p. 292, 2024, doi:10.3390/s24010292.
[122] Y. Mori, T. Hirakawa, T. Yamashita, and H. Fujiyoshi, “Image captioning for near-future events from vehicle camera images and motion information,” in 2021 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2021, pp. 1378–1384, doi:10.1109/IV48863.2021.9575562.
[123] H. Lee, J. Song, K. Hwang, and M. Kang, “Auto-Scenario Generator for Autonomous Vehicle Safety: Multi-modal Attention-based Image Captioning Model using Digital Twin Data,” IEEE Access, vol. 12, pp. 159 670–159 687, 2024, doi:10.1109/ACCESS.2024.3487588.
[124] F. Chen, C. Xu, Q. Jia, Y. Wang, Y. Liu, H. Zhang, and E. Wang, “Egocentric Vehicle Dense Video Captioning,” in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 137–146, doi:10.1145/3664647.3681214.
[125] D. Ghataoura and S. Ogbonnaya, “Application of image captioning and retrieval to support military decision making,” in 2021 international conference on military communication and information systems (ICMCIS). IEEE, 2021, pp. 1–8, doi:10.1109/ICMCIS52405.2021.9486395.
[126] S. Das, L. Jain, and A. Das, “Deep learning for military image captioning,” in 2018 21st International Conference on Information Fusion (FUSION). IEEE, 2018, pp. 2165–2171, doi:10.23919/ICIF.2018.8455321.
[127] L. Pan, C. Song, X. Gan, K. Xu, and Y. Xie, “ Military Image Captioning for Low-Altitude UAV or UGV Perspectives ,” Drones, vol. 8, no. 9, p. 421, 2024, doi:10.3390/drones8090421.
[128] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023, doi:10.48550/arXiv.2303.18223.
[129] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang et al., “A survey on evaluation of large language models,” ACM Transactions on Intelligent Systems and Technology, vol. 15, no. 3, pp. 1–45, 2024, doi:10.1145/3641289.
[130] M. E. Unal, B. Citamak, S. Yagcioglu, A. Erdem, E. Erdem, N. I. Cinbis, and R. Cakici, “TasvirEt: A benchmark dataset for automatic Turkish description generation from images,” in 2016 24th signal processing and communication application conference (SIU). IEEE, 2016, pp. 1977–1980, doi:10.1109/SIU.2016.7496155.
[131] N. Samet, S. Hiçsönmez, P. Duygulu, and E. Akbaş, “Could we create a training set for image captioning using automatic translation?” in 2017 25th Signal Processing and Communications Applications Conference (SIU). IEEE, 2017, pp. 1–4, doi:10.1109/SIU.2017.7960638.
[132] M. Kuyu, A. Erdem, and E. Erdem, “Image captioning in Turkish with subword units,” in 2018 26th Signal Processing and Communications Applications Conference (SIU). IEEE, 2018, pp. 1–4, doi:10.1109/SIU.2018.8404431.
[133] B. D. Yılmaz, A. E. Demir, E. B. Sönmez, and T. Yıldız, “Image captioning in turkish language,” in 2019 Innovations in Intelligent Systems and Applications Conference (ASYU). IEEE, 2019, pp. 1–5, doi:10.1109/ASYU48272.2019.8946358.
[134] T. Yıldız, E. B. Sönmez, B. D. Yılmaz, and A. E. Demir, “Image captioning in Turkish language: Database and model,” Journal of the Faculty of Engineering and Architecture of Gazi University, vol. 35, no. 4, pp. 2089–2100, 2020, doi:10.17341/gazimmfd.597089.
[135] S. B. Golech, S. B. Karacan, E. B. Sönmez, and H. Ayral, “A complete human verified turkish caption dataset for ms coco and performance evaluation with well-known image caption models trained against it,” in 2022 International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME). IEEE, 2022, pp. 1–6, doi:10.1109/ICECCME55909.2022.9988025.
[136] B. Atıcı and S. İlhan Omurca, “Generating classified ad product image titles with image captioning,” in Trends in Data Engineering Methods for Intelligent Systems: Proceedings of the International Conference on Artificial Intelligence and Applied Mathematics in Engineering (ICAIAME 2020). Springer, 2021, pp. 211–219, doi:10.1007/978-3-030-79357-9_21.
[137] S. Yıldız, A. Memi¸s, and S. Varlı, “Automatic Turkish image captioning: The impact of deep machine translation,” in 2023 8th International Conference on Computer Science and Engineering (UBMK). IEEE, 2023, pp. 414–419, doi:10.1109/UBMK59864.2023.10286693.
[138] S. Yıldız, A. Memiş, and S. Varlı, “TRCaptionNet: A novel and accurate deep Turkish image captioning model with vision transformer based image encoders and deep linguistic text decoders,” Turkish Journal of Electrical Engineering & Computer Sciences, vol. 31, no. 6, pp. 1079–1098, 2023, doi:10.55730/1300-0632.4035.
[139] A. Ersoy, O. T. Yıldız, and S. Özer, “ORTPiece: An ORT-based Turkish image captioning network based on transformers and wordpiece,” in 2023 31st Signal Processing and Communications Applications Conference (SIU). IEEE, 2023, pp. 1–4, doi:10.1109/SIU59756.2023.10223956.
[140] M. A. C. Ertuğrul and S. İ. Omurca, “Generating image captions using deep neural networks,” in 2023 8th International Conference on Computer Science and Engineering (UBMK). IEEE, 2023, pp. 271–275, doi:10.1109/UBMK59864.2023.10286622.
[141] S. Yıldız, A. Memiş, and S. Varlı, “Turkish image captioning with vision transformer based encoders and text decoders,” in 2024 32nd Signal Processing and Communications Applications Conference (SIU). IEEE, 2024, pp. 1–4, doi:10.1109/SIU61531.2024.10600738.
[142] S. Schweter, “BERTurk - BERT models for Turkish,” Apr. 2020.
[143] G. Uludoğan, Z. Y. Balal, F. Akkurt, M. Türker, O. Güngör, and S. Üsküdarlı, “Turna: A turkish encoder-decoder language model for enhanced understanding and generation,” arXiv preprint arXiv:2401.14373, 2024, doi:10.48550/arXiv.2401.14373.
[144] H. T. Kesgin, M. K. Yuce, and M. F. Amasyali, “Developing and evaluating tiny to medium-sized turkish bert models,” arXiv preprint arXiv:2307.14134, 2023, doi:10.48550/arXiv.2307.14134.
[145] N. Tas, “Roberturk: Adjusting roberta for turkish,” arXiv preprint arXiv:2401.03515, 2024, doi:10.48550/arXiv.2401.03515.
[146] H. T. Kesgin, M. K. Yuce, E. Dogan, M. E. Uzun, A. Uz, H. E. Seyrek, A. Zeer, and M. F. Amasyali, “Introducing cosmosgpt: Monolingual training for turkish language models,” arXiv preprint arXiv:2404.17336, 2024, doi:10.48550/arXiv.2404.17336.
[147] H. Türkmen, O. Dikenelli, C. Eraslan, M. C. Callı, and S. S. Özbek, “Bioberturk: Exploring turkish biomedical language model development strategies in low-resource setting,” Journal of Healthcare Informatics Research, vol. 7, no. 4, pp. 433–446, 2023, doi:10.1007/s41666-023-00140-7.
[148] OpenAI, “ChatGPT,” https://chat.openai.com/chat, 2023.
[149] Google, “Bard,” https://bard.google.com/, 2023.
[150] Google, “Gemini,” https://gemini.google.com/, 2024.

Türkçe Görüntü Altyazılama: Mevcut Çalışmalar, Uygulamalar, Veri Setleri, Metrikler ve Gelecekteki Olası Eğilimler Üzerine Bir İnceleme

Yıl 2025, Cilt: 15 Sayı: 2, 81 - 97, 30.05.2025

Abbas Memiş

Öz

Son on yılda bilgisayarla görme ve makine öğrenmesinde kaydedilen sarsıcı gelişmeler, akıllı sistemlerin hızla yaygınlaştığı, çeşitlendiği ve insan hayatına daha çok etki ettiği bir dönemi başlatmıştır. Bu gelişmelerin ivmelendirdiği ve yön verdiği önemli araştırma alanlarından birisi de otomatik görüntü altyazılamadır. Son yıllarda otomatik görüntü altyazılama kapsamında çok farklı diller için kaydedeğer seviyede çalışmalar yapılmakla birlikte, Türkçe görüntü altyazılamada da önemli ve umut verici gelişmelerin yaşandığı görülmektedir. Sunulan makalede, Türkçe görüntü altyazılamadaki bu önemli ve umut verici gelişmelerin ışığında, Türkçe görüntü altyazılama özelindeki mevcut çalışmaları ve uygulamaları ele alan kapsamlı bir inceleme çalışması raporlanmıştır. Çalışma kapsamında, güncel literatürdeki çalışmalara ek olarak, uygulama alanları ve veri setleri de değerlendirilmiştir. Makalede ayrıca, görüntü altyazılama sistemlerinin altyazı oluşturma başarımlarını ölçmek amacıyla kullanılan genel ve standart metriklere de yer verilmiştir. Bununla birlikte, Türkçe görüntü altyazılama için gelecekteki olası eğilimler ve potansiyel gelişmeler makale yazarlarının perspektiflerinden değerlendirilmiş ve paylaşılmıştır.

Anahtar Kelimeler

Görüntü altyazılama, Türkçe görüntü altyazılama, görüntü altyazılama alanları, görüntü-altyazı veri setleri, görüntü altyazılama metrikleri

Kaynakça

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
[2] S. Yıldız, O. Aydemir, A. Memiş, and S. Varlı, “A turnaround control system to automatically detect and monitor the time stamps of ground service actions in airports: A deep learning and computer vision based approach,” Engineering Applications of Artificial Intelligence, vol. 114, p. 105032, 2022, doi:10.1016/j.engappai.2022.105032.
[3] X. Chen, X. Wang, K. Zhang, K.-M. Fung, T. C. Thai, K. Moore, R. S. Mannel, H. Liu, B. Zheng, and Y. Qiu, “Recent advances and clinical applications of deep learning in medical image analysis,” Medical image analysis, vol. 79, p. 102444, 2022, doi:10.1016/j.media.2022.102444.
[4] A. M. Ozbayoglu, M. U. Gudelek, and O. B. Sezer, “Deep learning for financial applications: A survey,” Applied soft computing, vol. 93, p. 106384, 2020, doi:10.1016/j.asoc.2020.106384.
[5] X. Zhifeng, “Virtual entertainment robot based on artificial intelligence image capture system for sports and fitness posture recognition,” Entertainment Computing, vol. 52, p. 100793, 2025, doi:10.1016/j.entcom.2024.100793.
[6] H. Wang, H. Wu, Z. He, L. Huang, and K. W. Church, “Progress in machine translation,” Engineering, vol. 18, pp. 143–153, 2022, doi:10.1016/j.eng.2021.03.023.
[7] R. K. Kaliyar, A. Goswami, and P. Narang, “FakeBERT: Fake news detection in social media with a BERT-based deep learning approach,” Multimedia tools and applications, vol. 80, no. 8, pp. 11 765–11 788, 2021, doi:10.1007/s11042-020-10183-2.
[8] Y. Yang, K. Zhang, and P. Kannan, “Identifying market structure: A deep network representation learning of social engagement,” Journal of Marketing, vol. 86, no. 4, pp. 37–56, 2022, doi:10.1177/00222429211033585.
[9] M. Soori, B. Arezoo, and R. Dastres, “Artificial intelligence, machine learning and deep learning in advanced robotics, a review,” Cognitive Robotics, vol. 3, pp. 54–70, 2023, doi:10.1016/j.cogr.2023.04.001.
[10] M. Ozkan-Ozay, E. Akin, Ö. Aslan, S. Kosunalp, T. Iliev, I. Stoyanov, and I. Beloev, “A comprehensive survey: Evaluating the efficiency of artificial intelligence and machine learning techniques on cyber security solutions,” IEEE Access, 2024, doi:10.1109/ACCESS.2024.3355547.
[11] Y. Matsuzaka and R. Yashiro, “Ai-based computer vision techniques and expert systems,” AI, vol. 4, no. 1, pp. 289–302, 2023, doi:10.3390/ai4010013.
[12] S. V. Mahadevkar, B. Khemani, S. Patil, K. Kotecha, D. R. Vora, A. Abraham, and L. A. Gabralla, “A review on machine learning styles in computer vision—techniques and future directions,” Ieee Access, vol. 10, pp. 107 293–107 329, 2022, doi:10.1109/ACCESS.2022.3209825.
[13] R. Kaur and S. Singh, “A comprehensive review of object detection with deep learning,” Digital Signal Processing, vol. 132, p. 103812, 2023, doi:10.1016/j.dsp.2022.103812.
[14] D. Meimetis, I. Daramouskas, I. Perikos, and I. Hatzilygeroudis, “Real-time multiple object tracking using deep learning methods,” Neural Computing and Applications, vol. 35, no. 1, pp. 89–118, 2023, doi:10.1007/s00521-021-06391-y.
[15] Y. Mo, Y. Wu, X. Yang, F. Liu, and Y. Liao, “Review the state-of-the-art technologies of semantic segmentation based on deep learning,” Neurocomputing, vol. 493, pp. 626–646, 2022, doi:10.1016/j.neucom.2022.01.005.
[16] S. Yıldız, “Turkish scene text recognition: Introducing extensive real and synthetic datasets and a novel recognition model,” Engineering Science and Technology, an International Journal, vol. 60, p. 101881, 2024, doi:10.1016/j.jestch.2024.101881.
[17] D. Sarvamangala and R. V. Kulkarni, “Convolutional neural networks in medical image understanding: a survey,” Evolutionary intelligence, vol. 15, no. 1, pp. 1–22, 2022, doi:10.1007/s12065-020-00540-3.
[18] L. Xu, Q. Tang, J. Lv, B. Zheng, X. Zeng, and W. Li, “Deep image captioning: A review of methods, trends and future challenges,” Neurocomputing, vol. 546, p. 126287, 2023, doi:10.1016/j.neucom.2023.126287.
[19] A. Verma, A. K. Yadav, M. Kumar, and D. Yadav, “Automatic image caption generation using deep learning,” Multimedia Tools and Applications, vol. 83, no. 2, pp. 5309–5325, 2024, doi:10.1007/s11042-023-15555-y.
[20] L. Agarwal and B. Verma, “From methods to datasets: A survey on Image-Caption Generators,” Multimedia Tools and Applications, vol. 83, no. 9, pp. 28 077–28 123, 2024, doi:10.1007/s11042-023-16560-x.
[21] S. Sreela and S. M. Idicula, “A Systematic Survey of Automatic Image Description Generation Systems,” International Journal of Image and Graphics, p. 2650002, 2024, doi:10.1142/S0219467826500026.
[22] W. Li, Z. Qu, H. Song, P. Wang, and B. Xue, “The traffic scene understanding and prediction based on image captioning,” IEEE Access, vol. 9, pp. 1420–1427, 2020, doi:10.1109/ACCESS.2020.3047091.
[23] Y. Ming, N. Hu, C. Fan, F. Feng, J. Zhou, and H. Yu, “Visuals to text: A comprehensive review on automatic image captioning,” IEEE/CAA Journal of Automatica Sinica, vol. 9, no. 8, pp. 1339–1365, 2022, doi:10.1109/JAS.2022.105734.
[24] M. Z. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga, “A comprehensive survey of deep learning for image captioning,” ACM Computing Surveys (CsUR), vol. 51, no. 6, pp. 1–36, 2019, doi:10.1145/3295748.
[25] D. Sharma, C. Dhiman, and D. Kumar, “Evolution of visual data captioning methods, datasets, and evaluation metrics: A comprehensive survey,” Expert Systems with Applications, vol. 221, p. 119773, 2023, doi:10.1016/j.eswa.2023.119773.
[26] F. Chen, X. Li, J. Tang, S. Li, and T. Wang, “A survey on recent advances in image captioning,” in Journal of Physics: Conference Series, vol. 1914, no. 1. IOP Publishing, 2021, p. 012053, doi:10.1088/1742-6596/1914/1/012053.
[27] H. Sharma and D. Padha, “A comprehensive survey on image captioning: from handcrafted to deep learning-based techniques, a taxonomy and open research issues,” Artificial Intelligence Review, vol. 56, no. 11, pp. 13 619–13 661, 2023, doi:10.1007/s10462-023-10488-2.
[28] X. Huang, K. Lu, S. Wang, J. Lu, X. Li, and R. Zhang, “Understanding remote sensing imagery like reading a text document: What can remote sensing image captioning offer?” International Journal of Applied Earth Observation and Geoinformation, vol. 131, p. 103939, 2024, doi:10.1016/j.jag.2024.103939.
[29] Y. Li, X. Zhang, X. Cheng, X. Tang, and L. Jiao, “Learning consensus-aware semantic knowledge for remote sensing image captioning ,” Pattern Recognition, vol. 145, p. 109893, 2024, doi:10.1016/j.patcog.2023.109893.
[30] K. Zhao and W. Xiong, “Exploring region features in remote sensing image captioning,” International Journal of Applied Earth Observation and Geoinformation, vol. 127, p. 103672, 2024, doi:10.1016/j.jag.2024.103672.
[31] C. Chunseong Park, B. Kim, and G. Kim, “Attend to you: Personalized image captioning with context sequence memory networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 895–903, doi:10.1109/CVPR.2017.681.
[32] N. Sanguansub, P. Kamolrungwarakul, S. Poopair, K. Techaphonprasit, and T. Siriborvornratanakul, “Song lyrics recommendation for social media captions using image captioning, image emotion, and caption-lyric matching via universal sentence embedding,” Social Network Analysis and Mining, vol. 13, no. 1, p. 95, 2023, doi:10.1007/s13278-023-01097-6.
[33] J.-H. Wang, C.-W. Huang, and M. Norouzi, “Improving Rumor Detection by Image Captioning and Multi-Cell Bi-RNN With Self-Attention in Social Networks,” International Journal of Data Warehousing and Mining (IJDWM), vol. 18, no. 1, pp. 1–17, 2022, doi:10.4018/IJDWM.313189.
[34] A. Tran, A. Mathews, and L. Xie, “Transform and tell: Entity-aware news image captioning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 13 035–13 045, doi:10.1109/CVPR42600.2020.01305.
[35] Z. Zhang, H. Zhang, J. Wang, Z. Sun, and Z. Yang, “Generating news image captions with semantic discourse extraction and contrastive style-coherent learning,” Computers and Electrical Engineering, vol. 104, p. 108429, 2022, doi:10.1016/j.compeleceng.2022.108429.
[36] J. Chen and H. Zhuge, “A news image captioning approach based on multimodal pointer-generator network,” Concurrency and Computation: Practice and Experience, vol. 34, no. 7, p. e5721, 2022, doi:10.1002/cpe.5721.
[37] A. Selivanov, O. Y. Rogov, D. Chesakov, A. Shelmanov, I. Fedulova, and D. V. Dylov, “Medical image captioning via generative pretrained transformers,” Scientific Reports, vol. 13, no. 1, p. 4171, 2023, doi:10.1038/s41598-023-31223-5.
[38] G. Reale-Nosei, E. Amador-Domínguez, and E. Serrano, “From vision to text: A comprehensive review of natural image captioning in medical diagnosis and radiology report generation,” Medical Image Analysis, p. 103264, 2024, doi:10.1016/j.media.2024.103264.
[39] J. Pavlopoulos, V. Kougia, I. Androutsopoulos, and D. Papamichail, “Diagnostic captioning: a survey,” Knowledge and Information Systems, vol. 64, no. 7, pp. 1691–1722, 2022, doi:10.1007/s10115-022-01684-7.
[40] D. H. Fudholi, Y. Windiatmoko, N. Afrianto, P. E. Susanto, M. Suyuti, A. F. Hidayatullah, and R. Rahmadi, “Image captioning with attention for smart local tourism using efficientnet,” in IOP Conference Series: Materials Science and Engineering, vol. 1077, no. 1. IOP Publishing, 2021, p. 012038, doi:10.1088/1757-899X/1077/1/012038.
[41] S. Watcharabutsarakham, S. Marukatat, K. Kiratiratanapruk, and P. Temniranrat, “Image Captioning for Thai Cultures,” in 2022 17th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP). IEEE, 2022, pp. 1–5, doi:10.1109/iSAI-NLP56921.2022.9960251.
[42] Y. Bounab, M. Oussalah, and A. Ferdenache, “Reconciling image captioning and user’s comments for urban tourism,” in 2020 Tenth International Conference on Image Processing Theory, Tools and Applications (IPTA). IEEE, 2020, pp. 1–6, doi:10.1109/IPTA50016.2020.9286602.
[43] B. Zheng, F. Liu, M. Zhang, T. Zhou, S. Cui, Y. Ye, and Y. Guo, “Image captioning for cultural artworks: a case study on ceramics,” Multimedia Systems, vol. 29, no. 6, pp. 3223–3243, 2023, doi:10.1007/s00530-023-01178-8.
[44] E. Cetinic, “Iconographic image captioning for artworks,” in Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10–15, 2021, Proceedings, Part III. Springer, 2021, pp. 502–516, doi:10.1007/978-3-030-68796-0_36.
[45] Y. Lu, C. Guo, X. Dai, and F.-Y. Wang, “Artcap: A dataset for image captioning of fine art paintings,” IEEE Transactions on Computational Social Systems, vol. 11, no. 1, pp. 576–587, 2022, doi:10.1109/TCSS.2022.3223539.
[46] J. Yi, C. Wu, X. Zhang, X. Xiao, Y. Qiu, W. Zhao, T. Hou, and D. Cao, “Micer: a pre-trained encoder–decoder architecture for molecular image captioning,” Bioinformatics, vol. 38, no. 19, pp. 4562–4572, 2022, doi:10.1093/bioinformatics/btac545.
[47] A. Lozano, M. W. Sun, J. Burgess, L. Chen, J. J. Nirschl, J. Gu, I. Lopez, J. Aklilu, A. W. Katzer, C. Chiu et al., “BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature,” arXiv preprint arXiv:2501.07171, 2025, doi:10.48550/arXiv.2501.07171.
[48] D. Tran, N. T. Pham, N. Nguyen, and B. Manavalan, “ Mol2Lang-VLM: Vision-and Text-Guided Generative Pre-trained Language Models for Advancing Molecule Captioning through Multimodal Fusion,” in Proceedings of the 1st Workshop on Language+ Molecules (L+M 2024), 2024, pp. 97–102, doi:10.18653/v1/2024.langmol-1.12.
[49] H. Sharma and D. Padha, “Domain-specific image captioning: a comprehensive review,” International Journal of Multimedia Information Retrieval, vol. 13, no. 2, pp. 1–27, 2024, doi:10.1007/s13735-024-00328-6.
[50] X. He and L. Deng, “Deep learning for image-to-text generation: A technical overview,” IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 109–116, 2017, doi:10.1109/MSP.2017.2741510.
[51] S. Bai and S. An, “A survey on automatic image caption generation,” Neurocomputing, vol. 311, pp. 291–304, 2018, doi:10.1016/j.neucom.2018.05.080.
[52] Y. Yang, C. Teo, H. Daumé III, and Y. Aloimonos, “Corpus-guided sentence generation of natural images,” in Proceedings of the 2011 conference on empirical methods in natural language processing, 2011, pp. 444–454.
[53] R. Mason and E. Charniak, “Nonparametric method for data-driven image captioning,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2014, pp. 592–598.
[54] T. Ghandi, H. Pourreza, and H. Mahyar, “Deep learning approaches on image captioning: A review,” ACM Computing Surveys, vol. 56, no. 3, pp. 1–39, 2023, doi:10.1145/3617592.
[55] J. Aneja, A. Deshpande, and A. G. Schwing, “Convolutional image captioning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5561–5570, doi:10.1109/CVPR.2018.00583.
[56] K. R. Suresh, A. Jarapala, and P. Sudeep, “Image captioning encoder–decoder models using cnn-rnn architectures: A comparative study,” Circuits, Systems, and Signal Processing, vol. 41, no. 10, pp. 5719–5742, 2022, doi:10.1007/s00034-022-02050-2.
[57] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning with semantic attention,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4651–4659, doi:10.1109/CVPR.2016.503.
[58] L. Huang, W. Wang, J. Chen, and X.-Y. Wei, “Attention on attention for image captioning,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 4634–4643, doi:10.1109/ICCV.2019.00473.
[59] X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene graphs for image captioning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10 685–10 694, doi:10.1109/CVPR.2019.01094.
[60] N. Xu, A.-A. Liu, J. Liu, W. Nie, and Y. Su, “Scene graph captioner: Image captioning based on structural visual representation,” Journal of Visual Communication and Image Representation, vol. 58, pp. 477–485, 2019, doi:10.1016/j.jvcir.2018.12.027.
[61] X. Li and S. Jiang, “Know more say less: Image captioning based on scene graphs,” IEEE Transactions on Multimedia, vol. 21, no. 8, pp. 2117–2130, 2019, doi:10.1109/TMM.2019.2896516.
[62] X. Xiao, L. Wang, K. Ding, S. Xiang, and C. Pan, “Deep hierarchical encoder-decoder network for image captioning,” IEEE Transactions on Multimedia, vol. 21, no. 11, pp. 2942–2956, 2019, doi:10.1109/TMM.2019.2915033.
[63] R. Castro, I. Pineda, W. Lim, and M. E. Morocho-Cayamcela, “Deep learning approaches based on transformer architectures for image captioning tasks,” IEEE Access, vol. 10, pp. 33 679–33 694, 2022, doi:10.1109/ACCESS.2022.3161428.
[64] M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-memory transformer for image captioning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 578–10 587, doi:10.1109/CVPR42600.2020.01059.
[65] K. Qian, Y. Pan, H. Xu, and L. Tian, “Transformer model incorporating local graph semantic attention for image caption,” The Visual Computer, pp. 1–12, 2023, doi:10.1007/s00371-023-03180-7.
[66] Y. Feng, L. Ma, W. Liu, and J. Luo, “Unsupervised image captioning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4125–4134, doi:10.1109/CVPR.2019.00425.
[67] Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li, “Deep reinforcement learning-based image captioning with embedding reward,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 290–298, doi:10.1109/CVPR.2017.128.
[68] X. Hu, Z. Gan, J. Wang, Z. Yang, Z. Liu, Y. Lu, and L. Wang, “Scaling up vision-language pre-training for image captioning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 17 980–17 989, doi:10.1109/CVPR52688.2022.01745.
[69] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[70] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation, parallel distributed processing, explorations in the microstructure of cognition,” ed. de rumelhart and j. mcclelland. vol. 1. 1986, Biometrika, vol. 71, no. 599-607, p. 6, 1986, doi:10.7551/mitpress/4943.003.0128.
[71] A. Tsantekidis, N. Passalis, and A. Tefas, “Recurrent neural networks,” in Deep learning for robot perception and cognition. Elsevier, 2022, pp. 101–115, doi:10.1016/B978-0-32-385787-1.00010-5.
[72] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997, doi:10.1162/neco.1997.9.8.1735.
[73] A. Sherstinsky, “Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network,” Physica D: Nonlinear Phenomena, vol. 404, p. 132306, 2020, doi:10.1016/j.physd.2019.132306.
[74] K. Cho, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014, doi:10.48550/arXiv.1406.1078.
[75] O. Ondeng, H. Ouma, and P. Akuon, “A review of transformer-based approaches for image captioning,” Applied Sciences, vol. 13, no. 19, p. 11103, 2023, doi:10.3390/app131911103.
[76] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, etT. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020, doi:10.48550/arXiv.2010.11929.
[77] M. J. Parseh and S. Ghadiri, “Graph-based image captioning with semantic and spatial features,” Signal Processing: Image Communication, vol. 133, p. 117273, 2025, doi:10.1016/j.image.2025.117273.
[78] J. Jia, X. Ding, S. Pang, X. Gao, X. Xin, R. Hu, and J. Nie, “Image captioning based on scene graphs: A survey,” Expert Systems with Applications, vol. 231, p. 120698, 2023, doi:10.1016/j.eswa.2023.120698.
[79] A. S. Al-Shamayleh, O. Adwan, M. A. Alsharaiah, A. H. Hussein, Q. M. Kharma, and C. I. Eke, “A comprehensive literature review on image captioning methods and metrics based on deep learning technique,” Multimedia Tools and Applications, vol. 83, no. 12, pp. 34 219–34 268, 2024, doi:10.1007/s11042-024-18307-8.
[80] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755, doi:10.1007/978-3-319-10602-1_48.
[81] M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics,” Journal of Artificial Intelligence Research, vol. 47, pp. 853–899, 2013, doi:10.1613/jair.3994.
[82] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 67–78, 2014, doi:10.1162/tacl_a_00166.
[83] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3128–3137, doi:10.1109/TPAMI.2016.2598339.
[84] M. Grubinger, P. Clough, H. Müller, and T. Deselaers, “The iapr tc-12 benchmark: A new evaluation resource for visual information systems,” in International workshop ontoImage, vol. 2, 2006. [85] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, “Collecting image annotations using amazon’s mechanical turk,” in Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk, 2010, pp. 139–147.
[86] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth, “Every picture tells a story: Generating sentences from images,” in Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV 11. Springer, 2010, pp. 15–29, doi:10.1007/978-3-642-15561-1_2.
[87] V. Ordonez, G. Kulkarni, and T. Berg, “Im2text: Describing images using 1 million captioned photographs,” Advances in neural information processing systems, vol. 24, 2011.
[88] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li, “Yfcc100m: The new data in multimedia research,” Communications of the ACM, vol. 59, no. 2, pp. 64–73, 2016, doi:10.1145/2812802.
[89] A. Koochali, S. Kalkowski, A. Dengel, D. Borth, and C. Schulze, “Which languages do people speak on flickr? a language and geo-location study of the yfcc100m dataset,” in Proceedings of the 2016 ACM Workshop on Multimedia COMMONS, 2016, pp. 35–42, doi:10.1145/2983554.2983560.
[90] J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy, “Generation and comprehension of unambiguous object descriptions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 11–20, doi:10.1109/CVPR.2016.9.
[91] C. Gan, Z. Gan, X. He, J. Gao, and L. Deng, “Stylenet: Generating attractive visual captions with styles,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3137–3146, doi:10.1109/CVPR.2017.108.
[92] P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2556–2565, doi:10.18653/v1/P18-1238.
[93] H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson, “Nocaps: Novel object captioning at scale,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 8948–8957, doi:10.1109/ICCV.2019.00904.
[94] O. Sidorov, R. Hu, M. Rohrbach, and A. Singh, “Textcaps: a dataset for image captioning with reading comprehension,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. Springer, 2020, pp. 742–758, doi:10.1007/978-3-030-58536-5_44.
[95] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 3558–3568, doi:10.1109/CVPR46437.2021.00356.
[96] Y. Yoshikawa, Y. Shigeto, and A. Takeuchi, “Stair captions: Constructing a large-scale japanese image caption dataset,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2017, pp. 417–421, doi:10.48550/arXiv.1705.00823.
[97] A. Alsayed, T. M. Qadah, and M. Arif, “A performance analysis of transformer-based deep learning models for arabic image captioning,” Journal of King Saud University-Computer and Information Sciences, vol. 35, no. 9, p. 101750, 2023, doi:10.1016/j.jksuci.2023.101750.
[98] B. Das, R. Pal, M. Majumder, S. Phadikar, and A. A. Sekh, “A visual attention-based model for bengali image captioning,” SN Computer Science, vol. 4, no. 2, p. 208, 2023, doi:10.1007/s42979-023-01671-x.
[99] A. Rathi, “Deep learning apporach for image captioning in hindi language,” in 2020 international conference on computer, electrical & communication engineering (ICCECE). IEEE, 2020, pp. 1–8, doi:10.1109/ICCECE48148.2020.9223087.
[100] A. Mathews, L. Xie, and X. He, “Senticap: Generating image descriptions with sentiments,” in Proceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016, doi:10.1609/aaai.v30i1.10475.
[101] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International journal of computer vision, vol. 123, pp. 32–73, 2017, doi:10.1007/s11263-016-0981-7.
[102] B. Qu, X. Li, D. Tao, and X. Lu, “Deep semantic understanding of high resolution remote sensing image,” in 2016 International conference on computer, information and telecommunication systems (Cits). IEEE, 2016, pp. 1–5, doi:10.1109/CITS.2016.7546397.
[103] D. Demner-Fushman, M. D. Kohli, M. B. Rosenman, S. E. Shooshan, L. Rodriguez, S. Antani, G. R. Thoma, and C. J. McDonald, “Preparing a collection of radiology examinations for distribution and retrieval,” Journal of the American Medical Informatics Association, vol. 23, no. 2, pp. 304–310, 2016, doi:10.1093/jamia/ocv080.
[104] J. Vaishnavi and V. Narmatha, “Video captioning–a survey,” Multimedia Tools and Applications, pp. 1–32, 2024, doi:10.1007/s11042-024-18886-6.
[105] M. Abdar, M. Kollati, S. Kuraparthi, F. Pourpanah, D. McDuff, M. Ghavamzadeh, S. Yan, A. Mohamed, A. Khosravi, E. Cambria et al., “A review of deep learning for video captioning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, doi:10.1109/TPAMI.2024.3522295.
[106] M. S. Wajid, H. Terashima-Marin, P. Najafirad, and M. A. Wajid, “Deep learning and knowledge graph for image/video captioning: A review of datasets, evaluation metrics, and methods,” Engineering Reports, vol. 6, no. 1, p. e12785, 2024, doi:10.1002/eng2.12785.
[107] D. Chen and W. B. Dolan, “Collecting highly parallel data for paraphrase evaluation,” in Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, 2011, pp. 190–200.
[108] J. Xu, T. Mei, T. Yao, and Y. Rui, “Msr-vtt: A large video description dataset for bridging video and language,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5288–5296, doi:10.1109/CVPR.2016.571.
[109] M. Ravinder, V. Gupta, K. Arora, A. Ranjan, and Y.-C. Hu, “Video-Captioning Evaluation Metric for Segments (VEMS): A Metric for Segment-level Evaluation of Video Captions with Weighted Frames,” Multimedia Tools and Applications, vol. 83, no. 16, pp. 47 699–47 733, 2024, doi:10.1007/s11042-023-17328-z.
[110] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318, doi:10.3115/1073083.1073135.
[111] S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72.
[112] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out, 2004, pp. 74–81.
[113] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4566–4575, doi:10.1109/CVPR.2015.7299087.
[114] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. Springer, 2016, pp. 382–398, doi:10.1007/978-3-319-46454-1_24.
[115] J. Giménez and L. Màrquez, “Linguistic features for automatic evaluation of heterogenous mt systems,” in Proceedings of the Second Workshop on Statistical Machine Translation, 2007, pp. 256–264.
[116] Y. Li, C. Wu, L. Li, Y. Liu, and J. Zhu, “Caption generation from road images for traffic scene modeling,” IEEE Transactions on intelligent transportation systems, vol. 23, no. 7, pp. 7805–7816, 2021, doi:10.1109/TITS.2021.3072970.
[117] H. Zhang, C. Xu, B. Xu, M. Jian, H. Liu, and X. Li, “TSIC-CLIP: Traffic Scene Image Captioning Model Based on Clip,” Information Technology and Control, vol. 53, no. 1, pp. 98–114, 2024, doi:10.5755/j01.itc.53.1.35095.
[118] Q. Cheng, H. Huang, Y. Xu, Y. Zhou, H. Li, and Z. Wang, “Nwpu-captions dataset and mlca-net for remote sensing image captioning,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–19, 2022, doi:10.1109/TGRS.2022.3201474.
[119] M. Nivedita, P. Chandrashekar, S. Mahapatra, Y. A. V. Phamila, and S. K. Selvaperumal, “Image captioning for video surveillance system using neural networks,” International Journal of Image and Graphics, vol. 21, no. 04, p. 2150044, 2021, doi:10.1142/S0219467821500443.
[120] P. Liwen, “Image Captioning-based Smart Phone Forensics Analysis Model,” in Proceedings of the 2023 6th International Conference on Big Data Technologies, 2023, pp. 372–376, doi:10.1145/3627377.3627435.
[121] M. Jeon, J. Ko, and K. Cheoi, “Enhancing Surveillance Systems: Integration of Object, Behavior, and Space Information in Captions for Advanced Risk Assessment,” Sensors, vol. 24, no. 1, p. 292, 2024, doi:10.3390/s24010292.
[122] Y. Mori, T. Hirakawa, T. Yamashita, and H. Fujiyoshi, “Image captioning for near-future events from vehicle camera images and motion information,” in 2021 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2021, pp. 1378–1384, doi:10.1109/IV48863.2021.9575562.
[123] H. Lee, J. Song, K. Hwang, and M. Kang, “Auto-Scenario Generator for Autonomous Vehicle Safety: Multi-modal Attention-based Image Captioning Model using Digital Twin Data,” IEEE Access, vol. 12, pp. 159 670–159 687, 2024, doi:10.1109/ACCESS.2024.3487588.
[124] F. Chen, C. Xu, Q. Jia, Y. Wang, Y. Liu, H. Zhang, and E. Wang, “Egocentric Vehicle Dense Video Captioning,” in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 137–146, doi:10.1145/3664647.3681214.
[125] D. Ghataoura and S. Ogbonnaya, “Application of image captioning and retrieval to support military decision making,” in 2021 international conference on military communication and information systems (ICMCIS). IEEE, 2021, pp. 1–8, doi:10.1109/ICMCIS52405.2021.9486395.
[126] S. Das, L. Jain, and A. Das, “Deep learning for military image captioning,” in 2018 21st International Conference on Information Fusion (FUSION). IEEE, 2018, pp. 2165–2171, doi:10.23919/ICIF.2018.8455321.
[127] L. Pan, C. Song, X. Gan, K. Xu, and Y. Xie, “ Military Image Captioning for Low-Altitude UAV or UGV Perspectives ,” Drones, vol. 8, no. 9, p. 421, 2024, doi:10.3390/drones8090421.
[128] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023, doi:10.48550/arXiv.2303.18223.
[129] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang et al., “A survey on evaluation of large language models,” ACM Transactions on Intelligent Systems and Technology, vol. 15, no. 3, pp. 1–45, 2024, doi:10.1145/3641289.
[130] M. E. Unal, B. Citamak, S. Yagcioglu, A. Erdem, E. Erdem, N. I. Cinbis, and R. Cakici, “TasvirEt: A benchmark dataset for automatic Turkish description generation from images,” in 2016 24th signal processing and communication application conference (SIU). IEEE, 2016, pp. 1977–1980, doi:10.1109/SIU.2016.7496155.
[131] N. Samet, S. Hiçsönmez, P. Duygulu, and E. Akbaş, “Could we create a training set for image captioning using automatic translation?” in 2017 25th Signal Processing and Communications Applications Conference (SIU). IEEE, 2017, pp. 1–4, doi:10.1109/SIU.2017.7960638.
[132] M. Kuyu, A. Erdem, and E. Erdem, “Image captioning in Turkish with subword units,” in 2018 26th Signal Processing and Communications Applications Conference (SIU). IEEE, 2018, pp. 1–4, doi:10.1109/SIU.2018.8404431.
[133] B. D. Yılmaz, A. E. Demir, E. B. Sönmez, and T. Yıldız, “Image captioning in turkish language,” in 2019 Innovations in Intelligent Systems and Applications Conference (ASYU). IEEE, 2019, pp. 1–5, doi:10.1109/ASYU48272.2019.8946358.
[134] T. Yıldız, E. B. Sönmez, B. D. Yılmaz, and A. E. Demir, “Image captioning in Turkish language: Database and model,” Journal of the Faculty of Engineering and Architecture of Gazi University, vol. 35, no. 4, pp. 2089–2100, 2020, doi:10.17341/gazimmfd.597089.
[135] S. B. Golech, S. B. Karacan, E. B. Sönmez, and H. Ayral, “A complete human verified turkish caption dataset for ms coco and performance evaluation with well-known image caption models trained against it,” in 2022 International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME). IEEE, 2022, pp. 1–6, doi:10.1109/ICECCME55909.2022.9988025.
[136] B. Atıcı and S. İlhan Omurca, “Generating classified ad product image titles with image captioning,” in Trends in Data Engineering Methods for Intelligent Systems: Proceedings of the International Conference on Artificial Intelligence and Applied Mathematics in Engineering (ICAIAME 2020). Springer, 2021, pp. 211–219, doi:10.1007/978-3-030-79357-9_21.
[137] S. Yıldız, A. Memi¸s, and S. Varlı, “Automatic Turkish image captioning: The impact of deep machine translation,” in 2023 8th International Conference on Computer Science and Engineering (UBMK). IEEE, 2023, pp. 414–419, doi:10.1109/UBMK59864.2023.10286693.
[138] S. Yıldız, A. Memiş, and S. Varlı, “TRCaptionNet: A novel and accurate deep Turkish image captioning model with vision transformer based image encoders and deep linguistic text decoders,” Turkish Journal of Electrical Engineering & Computer Sciences, vol. 31, no. 6, pp. 1079–1098, 2023, doi:10.55730/1300-0632.4035.
[139] A. Ersoy, O. T. Yıldız, and S. Özer, “ORTPiece: An ORT-based Turkish image captioning network based on transformers and wordpiece,” in 2023 31st Signal Processing and Communications Applications Conference (SIU). IEEE, 2023, pp. 1–4, doi:10.1109/SIU59756.2023.10223956.
[140] M. A. C. Ertuğrul and S. İ. Omurca, “Generating image captions using deep neural networks,” in 2023 8th International Conference on Computer Science and Engineering (UBMK). IEEE, 2023, pp. 271–275, doi:10.1109/UBMK59864.2023.10286622.
[141] S. Yıldız, A. Memiş, and S. Varlı, “Turkish image captioning with vision transformer based encoders and text decoders,” in 2024 32nd Signal Processing and Communications Applications Conference (SIU). IEEE, 2024, pp. 1–4, doi:10.1109/SIU61531.2024.10600738.
[142] S. Schweter, “BERTurk - BERT models for Turkish,” Apr. 2020.
[143] G. Uludoğan, Z. Y. Balal, F. Akkurt, M. Türker, O. Güngör, and S. Üsküdarlı, “Turna: A turkish encoder-decoder language model for enhanced understanding and generation,” arXiv preprint arXiv:2401.14373, 2024, doi:10.48550/arXiv.2401.14373.
[144] H. T. Kesgin, M. K. Yuce, and M. F. Amasyali, “Developing and evaluating tiny to medium-sized turkish bert models,” arXiv preprint arXiv:2307.14134, 2023, doi:10.48550/arXiv.2307.14134.
[145] N. Tas, “Roberturk: Adjusting roberta for turkish,” arXiv preprint arXiv:2401.03515, 2024, doi:10.48550/arXiv.2401.03515.
[146] H. T. Kesgin, M. K. Yuce, E. Dogan, M. E. Uzun, A. Uz, H. E. Seyrek, A. Zeer, and M. F. Amasyali, “Introducing cosmosgpt: Monolingual training for turkish language models,” arXiv preprint arXiv:2404.17336, 2024, doi:10.48550/arXiv.2404.17336.
[147] H. Türkmen, O. Dikenelli, C. Eraslan, M. C. Callı, and S. S. Özbek, “Bioberturk: Exploring turkish biomedical language model development strategies in low-resource setting,” Journal of Healthcare Informatics Research, vol. 7, no. 4, pp. 433–446, 2023, doi:10.1007/s41666-023-00140-7.
[148] OpenAI, “ChatGPT,” https://chat.openai.com/chat, 2023.
[149] Google, “Bard,” https://bard.google.com/, 2023.
[150] Google, “Gemini,” https://gemini.google.com/, 2024.

Toplam 149 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	Türkçe
Konular	Elektrik Mühendisliği (Diğer)
Bölüm	Akademik ve/veya teknolojik bilimsel makale
Yazarlar	Abbas Memiş 0000-0003-2645-8071
Yayımlanma Tarihi	30 Mayıs 2025
Gönderilme Tarihi	3 Mart 2025
Kabul Tarihi	28 Mayıs 2025
Yayımlandığı Sayı	Yıl 2025 Cilt: 15 Sayı: 2

Kaynak Göster

APA	Memiş, A. (2025). Türkçe Görüntü Altyazılama: Mevcut Çalışmalar, Uygulamalar, Veri Setleri, Metrikler ve Gelecekteki Olası Eğilimler Üzerine Bir İnceleme. EMO Bilimsel Dergi, 15(2), 81-97.
AMA	Memiş A. Türkçe Görüntü Altyazılama: Mevcut Çalışmalar, Uygulamalar, Veri Setleri, Metrikler ve Gelecekteki Olası Eğilimler Üzerine Bir İnceleme. EMO Bilimsel Dergi. Mayıs 2025;15(2):81-97.
Chicago	Memiş, Abbas. “Türkçe Görüntü Altyazılama: Mevcut Çalışmalar, Uygulamalar, Veri Setleri, Metrikler Ve Gelecekteki Olası Eğilimler Üzerine Bir İnceleme”. EMO Bilimsel Dergi 15, sy. 2 (Mayıs 2025): 81-97.
EndNote	Memiş A (01 Mayıs 2025) Türkçe Görüntü Altyazılama: Mevcut Çalışmalar, Uygulamalar, Veri Setleri, Metrikler ve Gelecekteki Olası Eğilimler Üzerine Bir İnceleme. EMO Bilimsel Dergi 15 2 81–97.
IEEE	A. Memiş, “Türkçe Görüntü Altyazılama: Mevcut Çalışmalar, Uygulamalar, Veri Setleri, Metrikler ve Gelecekteki Olası Eğilimler Üzerine Bir İnceleme”, EMO Bilimsel Dergi, c. 15, sy. 2, ss. 81–97, 2025.
ISNAD	Memiş, Abbas. “Türkçe Görüntü Altyazılama: Mevcut Çalışmalar, Uygulamalar, Veri Setleri, Metrikler Ve Gelecekteki Olası Eğilimler Üzerine Bir İnceleme”. EMO Bilimsel Dergi 15/2 (Mayıs 2025), 81-97.
JAMA	Memiş A. Türkçe Görüntü Altyazılama: Mevcut Çalışmalar, Uygulamalar, Veri Setleri, Metrikler ve Gelecekteki Olası Eğilimler Üzerine Bir İnceleme. EMO Bilimsel Dergi. 2025;15:81–97.
MLA	Memiş, Abbas. “Türkçe Görüntü Altyazılama: Mevcut Çalışmalar, Uygulamalar, Veri Setleri, Metrikler Ve Gelecekteki Olası Eğilimler Üzerine Bir İnceleme”. EMO Bilimsel Dergi, c. 15, sy. 2, 2025, ss. 81-97.
Vancouver	Memiş A. Türkçe Görüntü Altyazılama: Mevcut Çalışmalar, Uygulamalar, Veri Setleri, Metrikler ve Gelecekteki Olası Eğilimler Üzerine Bir İnceleme. EMO Bilimsel Dergi. 2025;15(2):81-97.

Kapak Resmi İndir

Makale Dosyaları

Tam Metin

EMO BİLİMSEL DERGİ
Elektrik, Elektronik, Bilgisayar, Biyomedikal, Kontrol Mühendisliği Bilimsel Hakemli Dergisi
TMMOB ELEKTRİK MÜHENDİSLERİ ODASI
IHLAMUR SOKAK NO:10 KIZILAY/ANKARA
TEL: +90 (312) 425 32 72 (PBX) - FAKS: +90 (312) 417 38 18
bilimseldergi@emo.org.tr