Yeni Türkçe Veri Seti ile Evrişimli Sinir Ağları Kullanarak Kelime Seviyesinde Otomatik Dudak Okuma

Ali Berkol; Nergis Pervan Akman; Hamit Erdem

Araştırma Makalesi

Yeni Türkçe Veri Seti ile Evrişimli Sinir Ağları Kullanarak Kelime Seviyesinde Otomatik Dudak Okuma

Yıl 2025, Cilt: 9 Sayı: 1, 83 - 93, 31.07.2025

Ali Berkol , Nergis Pervan Akman , Hamit Erdem

Öz

Otomatik dudak okuma, son yıllarda önemli ölçüde gelişen bir araştırma problemidir. Dudak okuma, bazı durumlarda hem görsel hem de işitsel olarak değerlendirilmektedir. Bir güvenlik kamerasından istenmeyen bir kelimenin tespit edilmesi, görsel dudak okuma problemine bir örnektir. Bu tür salt görüntü içeren verilerin bulunduğu durumlarda, görsel-işitsel veri setleri uygulanamaz. Dolayısıyla, her durumda ses girdisine sahip olamayabiliriz. Telaffuz edilen kelimenin ses girdisini her durumda elde etmek mümkün değildir. Bu çalışmada, yalnızca görüntü içeren yeni bir Türkçe veri seti topladık. Yeni veri seti, kontrolsüz bir ortam olan Youtube videoları kullanılarak üretilmiştir. Bu nedenle, görüntüler ışık, açı, renk ve yüzün kişisel özellikleri gibi çevresel faktörler açısından zorlu parametrelere sahiptir. İnsan yüzündeki bıyık, sakal ve makyaj gibi farklı özelliklere rağmen, veri üzerinde herhangi bir müdahale olmadan Konvolüsyonel Sinir Ağları (CNN) kullanılarak tekil kelimeler ve iki kelimelik ifadeler dahil 10 sınıfta görsel konuşma tanıma problemi geliştirilmiştir. Yalnızca görsel veri kullanılarak önerilen çalışma, derin öğrenme yaklaşımıyla otomatikleştirilmiş görsel konuşma tanıma modelini elde etmiştir. Ayrıca, bu çalışmada yalnızca görsel veri kullanıldığı için hesaplama maliyeti ve kaynak kullanımı çok modlu çalışmalara göre daha azdır. Aynı zamanda Ural-Altay dillerine ait yeni bir veri seti kullanılarak dudak okuma sorununu derin öğrenme algoritmasıyla ele alan bilinen ilk çalışmadır.

Anahtar Kelimeler

– Dudak Okuma, Çoklu Sınıf Sınıflandırma, Türkçe Dudak Okuma Veri Seti, Derin Öğrenme, Evrişimli Sinir Ağları, Dudak Tespiti

Destekleyen Kurum

Aselsan-Bites

Kaynakça

[1] H. McGurk, J. MacDonald, “Hearing lips and seeing voices.” Nature, 264, pp. 746–748, 1976.
[2] A. Gabbay, A. Ephrat, T. Halperin, S. Peleg, “Seeing through noise: Speaker separation and enhancement using visually-derived speech.” arXiv preprint arXiv:1708.06767, 4, 2017.
[3] D. Stewart, R. Seymour, A. Pass, J. Ming, “Robust audio-visual speech recognition under noisy audio-video conditions.” IEEE transactions on cybernetics, 44, pp. 175–184, 2013.
[4] F.S. Lesani, F.F. Ghazvini, R. Dianat, “Mobile phone security using automatic lip reading.” in Proceedings of the 2015 9th International Conference on e-Commerce in Developing Countries: With focus on e-Business (ECDC). IEEE, 2015, pp. 1–5.
[5] S. Mathulaprangsan, C.Y. Wang, A.Z. Kusum, T.C. Tai, J.C. Wang, “A survey of visual lip reading and lip-password verification.” in Proceedings of the 2015 International Conference on Orange Technologies (ICOT). IEEE, 2015, pp. 22–25.
[6] S. Sengupta, A. Bhattacharya, P. Desai, A. Gupta, “Automated lip reading technique for password authentication.” International Journal of Applied Information Systems (IJAIS) 2012, pp. 18–24.
[7] J. Son Chung, A. Senior, O. Vinyals, A. Zisserman, “Lip reading sentences in the wild.” in Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6447–6456.
[8] Y.M. Assael, B. Shillingford, S. Whiteson, N. De Freitas, “Lipnet: Sentence-level lipreading.” arXiv preprint arXiv:1611.01599, 2, 2016.
[9] A. Ephrat, T. Halperin, S. Peleg, “Improved speech reconstruction from silent video.” in Proceedings of the Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017, pp. 455–462.
[10] A. Jaumard-Hakoun, K. Xu, C. Leboullenger, P. Roussel-Ragot, B. Denby, “An articulatory-based singing voice synthesis using tongue and lips imaging.” in Proceedings of the ISCA Interspeech 2016, 2016, vol. 2016, pp. 1467–1471.
[11] F. Bocquelet, T. Hueber, L. Girin, C. Savariaux, B. Yvert, “Real-time control of an articulatory-based speech synthesizer for brain computer interfaces.” PLoS computational biology 12, e1005119, 2016.
[12] A. Gabbay, A. Shamir, S. Peleg, “Visual speech enhancement.” arXiv preprint arXiv:1711.08789 2017.
[13] A.B. Mattos, D.A.B. Oliveira, “Multi-view mouth renderization for assisting lip-reading.” in Proceedings of the Proceedings of the 15th International Web for All Conference, 2018, pp. 1–10.
[14] I. Matthews, T.F. Cootes, J.A. Bangham, S. Cox, R. Harvey, “Extraction of visual features for lipreading.” IEEE Transactions on Pattern Analysis and Machine Intelligence 2002, 24, pp. 198–213.
[15] S.J. Cox, R.W. Harvey, Y. Lan, J.L. Newman, B.J. Theobald, “The challenge of multispeaker lip-reading.” in Proceedings of the AVSP, 2008, pp. 179–184.
[16] B. Lee, M. Hasegawa-Johnson, C. Goudeseune, S. Kamdar, S. Borys, M. Liu, T. Huang, “AVICAR: Audio-visual speech corpus in a car environment.” in Proceedings of the Eighth International Conference on Spoken Language Processing, 2004. pp. 2489-2492.
[17] Y.W. Wong, S.I. Ch’ng, K.P. Seng, L.M. Ang, S.W. Chin, W.J. Chew, K.H. Lim, “A new multi-purpose audio-visual UNMC-VIER database with multiple variabilities.” Pattern recognition letters 2011, 32, pp. 1503–1510.
[18] C. McCool, S. Marcel, A. Hadid, M. Pietikäinen, P. Matejka, J. Cernocký, “Bi-Modal Person Recognition on a Mobile Phone: Using Mobile Phone Data” 2012 IEEE international conference on multimedia and expo workshops. IEEE, 2012.
[19] A. Rekik, A. Ben-Hamadou and W. Mahdi, “A new visual speech recognition approach for RGB-D cameras.” in Proceedings of the International Conference Image Analysis and Recognition, Springer, 2014, pp. 21-28
[20] D. Estival, S. Cassidy, F. Cox, D. Burnham, “AusTalk: An Audio-Visual Corpus of Australian English.” in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), 2014, pp. 3105-3109.
[21] D. Petroyska-Delacrétaz, S. Lelandais, J. Colineau, L. Chen, B. Dorizzi, M. Ardabilian, E. Krichen, M.A. Mellakh, A. Chaari, S.Guerfi, et al., “The iv 2 Multimodal Biometric Database (Including Iris, 2d, 3d Stereosopic, and Talking Face Data), and Applications and Systems”, IEEE Second International Conference on Biometrics: Theory, Applications and Systems, IEEE, 2008, pp.1-7.
[22] J. Trojanová, M. Hrúz, P. Campr, M. Železny, “Design and Recording of Czech Audio-Visual Databsase with Impaired Conditions for Continuous Speech Recognition” in Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), 2008, pp.1-5.
[23] S. Petridis, J. Shen, D. Cetin, M. Pantic, “Visual-only Recognition of Normal, Whispered and Silent Speech.” in Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASP). IEEE 2018, pp. 6219-6223.
[24] K. Kumar, T. Chen, R.M. Stern, “Profile View Lip Reading”, in Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’07), 2007, pp. IV-429.
[25] T. Afouras, J.S. Chung, A. Senior, O. Vinyals, A. Zisserman, “Deep Audio-Viusal Speech Recognition” in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018 paper 44.12, pp. 8717-8727.
[26] A. Fernandez-Lopez, F.M. Sukno, “Survey on Automatic Lip-Reading in the Era of Deep Learning” Image and Vision Computing 78, pp.53-72.
[27] Z. Kang, R. Horaud, M. Sadeghi, “Robust face frontalization for visual speech recognition.” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2485-2495.
[28] G. Zhao, M. Barnard, M. Pietikainem, “Lipreading with local spatiotemporal descriptors.” IEEE Transactions on Multimedia, vol. 11, pp. 1254-1265, 2009.
[29] M. Gurban, J.P. Thiran, “Information theoretic feature extraction for audio-visual speech recognition.” IEEE Transactions on Signal Processing, vol. 57, pp. 4765-4776, 2009.
[30] Ü. Atilla, F. Sabaz, “Turkish lip-reading using Bi-LSTM and deep learning models.” Engineering Science and Technology, an International Journal, p. 101206, 2022.
[31] A. Garg, J. Noyola, S. Bagadia, “Lip reading using CNN and LSTM” Univ. of Stanford, CS231, Tech. Rep. 2016.
[32] S. Yang, Y. Zhang, D. Feng, M. Yang, C. Wang, J. Xiao, K, Long, S. Shan, X. Chen, “LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild.” in Proceedings of the 2019 14th IEEE International Conference on Automatic face & gesture recognition (FG 2019). IEEE, 2019, pp.1-8.
[33] A. Yargıç, M. Doğan, “A lip reading application on MS Kinect camera.” in Proceedings of the 2013 IEEE INISTA. IEEE, 2013, pp. 1-5.
[34] I. Fung, B. Mak, “End-to-end low-resource lip-reading with maxout CNN and LSTM.” in Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 2511-2515.
[35] T. Ozcan, A. Basturk, “Lip reading using convolutional neural networks with and without pre-trained models.” Balkan Journal of Electrical and Computer Engineering, vol. 7, pp.195-201, 2019.
[36] D.K. Margam, R. Aralikatti, T. Sharma, A. Thanda, S. Roy, S.M. Venkatesan, et al., “LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models.” arXiv preprint arXiv:1906.12170, 2019.
[37] S. Petridis, Z. Li, M. Pantic, “End-to-end visual speech recognition with LSTMs.” in Proceedings of the 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp.2592-2596.
[38] B. Martinez, P. Ma, S. Petridis, M. Pantic, “Lipreading using temporal convolutional networks.” in Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2020, pp.6319-6323.
[39] D. Kastaniotis, D. Tsourounis, A. Koureleas, B. Peev, C. Theoharatos, S. Fotopoulos, “Lip Reading in Greek words at unconstrained driving scenario.” in Proceedings of the 2019 10th International Conference on Information Intelligence Systems and Applications (IISA). IEEE, 2019, pp.1-6.
[40] A.M. Sarhan, N.M. Elshennawy, D.M. Ibrahim, “HLR-net: a hybrid lip-reading model based on deep convolutional neural networks.” Computers, Materials & Continua vol. 68, pp.1531-1549, 2021.
[41] J. Ting, C. Song, H. Huang, T. Tian, “A Comprehensive Dataset for Machine-Learning-based Lip-Reading Algorithm.” Procedia Computer Science vol. 199, pp. 1444-1449, 2022.
[42] Y. Fu, Y. Lu, R. Ni, “Chinese Lip-Reading Research Based on ShuffleNet and CBAM.” Applied Sciences vol. 13, 2023.
[43] N. Deshmukh, A. Ahire, S.H. Bhandari, A. Mali, K. Warkari, “Vision based Lip Reading System using Deep Learning.” in Proceedings of the 2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI). IEEE, 2021, pp. 1-6.
[44] A. Berkol, T. Tümer-Sivri, N. Pervan-Akman, M. Çolak, H. Erdem, “Visual Lip Reading Dataset in Turkish.” Data, vol. 8, 2023.
[45] A. Berkol, N. Pervan-Akman, H. Erdem, "Türkçe Günlük Kelime ve İfadeler Kullanarak CNN ve LSTM ile Görsel Konuşma Tanıma." International Journal of Multidisciplinary Studies and Innovative Technologies 8.2: 69-75.
[46] A. Berkol, N. Pervan-Akman, H. Erdem, "Lip reading multiclass classification by using dilated CNN with Turkish dataset." 2022 International Conference on Electrical, Computer and Energy Technologies (ICECET). IEEE, 2022

Automated Word-level Lip Reading using Convolutional Neural Networks with New Turkish Dataset

Yıl 2025, Cilt: 9 Sayı: 1, 83 - 93, 31.07.2025

Ali Berkol , Nergis Pervan Akman , Hamit Erdem

Öz

Automated lip reading is a research problem that has developed considerably in recent 1 years. Lip reading is evaluated both visually and audibly in some cases. Detecting an unwanted word from a security camera is an example of a visual lip reading problem. Audio-visual datasets are not applicable where such image-only data is involved. Therefore, we may not have audio input in all cases. It is not possible to obtain the sound input of the pronounced word in all cases. We collected a new Turkish dataset with only the image in this study. The new dataset is produced using Youtube videos, which is an uncontrolled environment. For this reason, images have difficult parameters in terms of environmental factors such as light, angle, color, and personal characteristics of the face. Despite the different features on the human face such as moustache, beard, and make-up, the visual speech recognition problem was developed on 10 classes including single words and two-word phrases using Convolutional Neural Networks (CNN) without any intervention on the data. The proposed study using only-visual data obtained a model which is automated visual speech recognition with a deep learning approach. In addition, since this study uses only-visual data, the computational cost and resource usage is less than in multi-modal studies. It is also the first known study to address the lip reading problem with a deep learning algorithm using a new dataset belonging to the Ural-Altaic languages.

Anahtar Kelimeler

Lip Reading, Multiclass Classification, Turkish Lip Reading Dataset, Deep Learning, Convolutional Neural Networks, Lip Detection

Kaynakça

[1] H. McGurk, J. MacDonald, “Hearing lips and seeing voices.” Nature, 264, pp. 746–748, 1976.
[2] A. Gabbay, A. Ephrat, T. Halperin, S. Peleg, “Seeing through noise: Speaker separation and enhancement using visually-derived speech.” arXiv preprint arXiv:1708.06767, 4, 2017.
[3] D. Stewart, R. Seymour, A. Pass, J. Ming, “Robust audio-visual speech recognition under noisy audio-video conditions.” IEEE transactions on cybernetics, 44, pp. 175–184, 2013.
[4] F.S. Lesani, F.F. Ghazvini, R. Dianat, “Mobile phone security using automatic lip reading.” in Proceedings of the 2015 9th International Conference on e-Commerce in Developing Countries: With focus on e-Business (ECDC). IEEE, 2015, pp. 1–5.
[5] S. Mathulaprangsan, C.Y. Wang, A.Z. Kusum, T.C. Tai, J.C. Wang, “A survey of visual lip reading and lip-password verification.” in Proceedings of the 2015 International Conference on Orange Technologies (ICOT). IEEE, 2015, pp. 22–25.
[6] S. Sengupta, A. Bhattacharya, P. Desai, A. Gupta, “Automated lip reading technique for password authentication.” International Journal of Applied Information Systems (IJAIS) 2012, pp. 18–24.
[7] J. Son Chung, A. Senior, O. Vinyals, A. Zisserman, “Lip reading sentences in the wild.” in Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6447–6456.
[8] Y.M. Assael, B. Shillingford, S. Whiteson, N. De Freitas, “Lipnet: Sentence-level lipreading.” arXiv preprint arXiv:1611.01599, 2, 2016.
[9] A. Ephrat, T. Halperin, S. Peleg, “Improved speech reconstruction from silent video.” in Proceedings of the Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017, pp. 455–462.
[10] A. Jaumard-Hakoun, K. Xu, C. Leboullenger, P. Roussel-Ragot, B. Denby, “An articulatory-based singing voice synthesis using tongue and lips imaging.” in Proceedings of the ISCA Interspeech 2016, 2016, vol. 2016, pp. 1467–1471.
[11] F. Bocquelet, T. Hueber, L. Girin, C. Savariaux, B. Yvert, “Real-time control of an articulatory-based speech synthesizer for brain computer interfaces.” PLoS computational biology 12, e1005119, 2016.
[12] A. Gabbay, A. Shamir, S. Peleg, “Visual speech enhancement.” arXiv preprint arXiv:1711.08789 2017.
[13] A.B. Mattos, D.A.B. Oliveira, “Multi-view mouth renderization for assisting lip-reading.” in Proceedings of the Proceedings of the 15th International Web for All Conference, 2018, pp. 1–10.
[14] I. Matthews, T.F. Cootes, J.A. Bangham, S. Cox, R. Harvey, “Extraction of visual features for lipreading.” IEEE Transactions on Pattern Analysis and Machine Intelligence 2002, 24, pp. 198–213.
[15] S.J. Cox, R.W. Harvey, Y. Lan, J.L. Newman, B.J. Theobald, “The challenge of multispeaker lip-reading.” in Proceedings of the AVSP, 2008, pp. 179–184.
[16] B. Lee, M. Hasegawa-Johnson, C. Goudeseune, S. Kamdar, S. Borys, M. Liu, T. Huang, “AVICAR: Audio-visual speech corpus in a car environment.” in Proceedings of the Eighth International Conference on Spoken Language Processing, 2004. pp. 2489-2492.
[17] Y.W. Wong, S.I. Ch’ng, K.P. Seng, L.M. Ang, S.W. Chin, W.J. Chew, K.H. Lim, “A new multi-purpose audio-visual UNMC-VIER database with multiple variabilities.” Pattern recognition letters 2011, 32, pp. 1503–1510.
[18] C. McCool, S. Marcel, A. Hadid, M. Pietikäinen, P. Matejka, J. Cernocký, “Bi-Modal Person Recognition on a Mobile Phone: Using Mobile Phone Data” 2012 IEEE international conference on multimedia and expo workshops. IEEE, 2012.
[19] A. Rekik, A. Ben-Hamadou and W. Mahdi, “A new visual speech recognition approach for RGB-D cameras.” in Proceedings of the International Conference Image Analysis and Recognition, Springer, 2014, pp. 21-28
[20] D. Estival, S. Cassidy, F. Cox, D. Burnham, “AusTalk: An Audio-Visual Corpus of Australian English.” in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), 2014, pp. 3105-3109.
[21] D. Petroyska-Delacrétaz, S. Lelandais, J. Colineau, L. Chen, B. Dorizzi, M. Ardabilian, E. Krichen, M.A. Mellakh, A. Chaari, S.Guerfi, et al., “The iv 2 Multimodal Biometric Database (Including Iris, 2d, 3d Stereosopic, and Talking Face Data), and Applications and Systems”, IEEE Second International Conference on Biometrics: Theory, Applications and Systems, IEEE, 2008, pp.1-7.
[22] J. Trojanová, M. Hrúz, P. Campr, M. Železny, “Design and Recording of Czech Audio-Visual Databsase with Impaired Conditions for Continuous Speech Recognition” in Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), 2008, pp.1-5.
[23] S. Petridis, J. Shen, D. Cetin, M. Pantic, “Visual-only Recognition of Normal, Whispered and Silent Speech.” in Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASP). IEEE 2018, pp. 6219-6223.
[24] K. Kumar, T. Chen, R.M. Stern, “Profile View Lip Reading”, in Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’07), 2007, pp. IV-429.
[25] T. Afouras, J.S. Chung, A. Senior, O. Vinyals, A. Zisserman, “Deep Audio-Viusal Speech Recognition” in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018 paper 44.12, pp. 8717-8727.
[26] A. Fernandez-Lopez, F.M. Sukno, “Survey on Automatic Lip-Reading in the Era of Deep Learning” Image and Vision Computing 78, pp.53-72.
[27] Z. Kang, R. Horaud, M. Sadeghi, “Robust face frontalization for visual speech recognition.” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2485-2495.
[28] G. Zhao, M. Barnard, M. Pietikainem, “Lipreading with local spatiotemporal descriptors.” IEEE Transactions on Multimedia, vol. 11, pp. 1254-1265, 2009.
[29] M. Gurban, J.P. Thiran, “Information theoretic feature extraction for audio-visual speech recognition.” IEEE Transactions on Signal Processing, vol. 57, pp. 4765-4776, 2009.
[30] Ü. Atilla, F. Sabaz, “Turkish lip-reading using Bi-LSTM and deep learning models.” Engineering Science and Technology, an International Journal, p. 101206, 2022.
[31] A. Garg, J. Noyola, S. Bagadia, “Lip reading using CNN and LSTM” Univ. of Stanford, CS231, Tech. Rep. 2016.
[32] S. Yang, Y. Zhang, D. Feng, M. Yang, C. Wang, J. Xiao, K, Long, S. Shan, X. Chen, “LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild.” in Proceedings of the 2019 14th IEEE International Conference on Automatic face & gesture recognition (FG 2019). IEEE, 2019, pp.1-8.
[33] A. Yargıç, M. Doğan, “A lip reading application on MS Kinect camera.” in Proceedings of the 2013 IEEE INISTA. IEEE, 2013, pp. 1-5.
[34] I. Fung, B. Mak, “End-to-end low-resource lip-reading with maxout CNN and LSTM.” in Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 2511-2515.
[35] T. Ozcan, A. Basturk, “Lip reading using convolutional neural networks with and without pre-trained models.” Balkan Journal of Electrical and Computer Engineering, vol. 7, pp.195-201, 2019.
[36] D.K. Margam, R. Aralikatti, T. Sharma, A. Thanda, S. Roy, S.M. Venkatesan, et al., “LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models.” arXiv preprint arXiv:1906.12170, 2019.
[37] S. Petridis, Z. Li, M. Pantic, “End-to-end visual speech recognition with LSTMs.” in Proceedings of the 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp.2592-2596.
[38] B. Martinez, P. Ma, S. Petridis, M. Pantic, “Lipreading using temporal convolutional networks.” in Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2020, pp.6319-6323.
[39] D. Kastaniotis, D. Tsourounis, A. Koureleas, B. Peev, C. Theoharatos, S. Fotopoulos, “Lip Reading in Greek words at unconstrained driving scenario.” in Proceedings of the 2019 10th International Conference on Information Intelligence Systems and Applications (IISA). IEEE, 2019, pp.1-6.
[40] A.M. Sarhan, N.M. Elshennawy, D.M. Ibrahim, “HLR-net: a hybrid lip-reading model based on deep convolutional neural networks.” Computers, Materials & Continua vol. 68, pp.1531-1549, 2021.
[41] J. Ting, C. Song, H. Huang, T. Tian, “A Comprehensive Dataset for Machine-Learning-based Lip-Reading Algorithm.” Procedia Computer Science vol. 199, pp. 1444-1449, 2022.
[42] Y. Fu, Y. Lu, R. Ni, “Chinese Lip-Reading Research Based on ShuffleNet and CBAM.” Applied Sciences vol. 13, 2023.
[43] N. Deshmukh, A. Ahire, S.H. Bhandari, A. Mali, K. Warkari, “Vision based Lip Reading System using Deep Learning.” in Proceedings of the 2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI). IEEE, 2021, pp. 1-6.
[44] A. Berkol, T. Tümer-Sivri, N. Pervan-Akman, M. Çolak, H. Erdem, “Visual Lip Reading Dataset in Turkish.” Data, vol. 8, 2023.
[45] A. Berkol, N. Pervan-Akman, H. Erdem, "Türkçe Günlük Kelime ve İfadeler Kullanarak CNN ve LSTM ile Görsel Konuşma Tanıma." International Journal of Multidisciplinary Studies and Innovative Technologies 8.2: 69-75.
[46] A. Berkol, N. Pervan-Akman, H. Erdem, "Lip reading multiclass classification by using dilated CNN with Turkish dataset." 2022 International Conference on Electrical, Computer and Energy Technologies (ICECET). IEEE, 2022

Toplam 46 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	Türkçe
Konular	Konuşma Tanıma
Bölüm	Makaleler
Yazarlar	Ali Berkol 0000-0002-3056-1226 Nergis Pervan Akman 0000-0003-3241-6812 Hamit Erdem 0000-0003-1704-1581
Erken Görünüm Tarihi	17 Temmuz 2025
Yayımlanma Tarihi	31 Temmuz 2025
Gönderilme Tarihi	3 Mart 2025
Kabul Tarihi	27 Nisan 2025
Yayımlandığı Sayı	Yıl 2025 Cilt: 9 Sayı: 1

Kaynak Göster

IEEE	A. Berkol, N. Pervan Akman, ve H. Erdem, “Yeni Türkçe Veri Seti ile Evrişimli Sinir Ağları Kullanarak Kelime Seviyesinde Otomatik Dudak Okuma”, IJMSIT, c. 9, sy. 1, ss. 83–93, 2025.

Kapak Resmi İndir

Makale Dosyaları

Tam Metin