Performance analysis of the most downloaded Turkish and English language models on the Hugging-Face platform

İnayet Hakkı Cizmeci; Kerem Gencer

Research Article

Performance analysis of the most downloaded Turkish and English language models on the Hugging-Face platform

Year 2025, Issue: 061, 13 - 24, 30.06.2025

İnayet Hakkı Cizmeci , Kerem Gencer

Abstract

This study analyzes the performance of the most popularly downloaded language models on the Hugging Face platform. For this purpose, the five most downloaded language models in Turkish and English were used. The analysis was evaluated in three phases. These stages were contextual learning, question and answer, and expert evaluation. ARC, Turkish sentiment analysis, Hellaswag, and MMLU datasets were used for contextual learning. For the question-and-answer test, the models trained with the text file created were asked questions from the text. Finally, six experts evaluated the answers given by the models from the developed mobile application. F1 score was used for context evaluation, Rouge-1, Rouge-2, and Rouge-L metrics were used for question and answer, and Elo and TrueSkill metrics were used for expert evaluations. The correlations of these metrics were calculated, and it was seen that there was a correlation of 0.74 between expert evaluations and question-answer performances. It was also observed that learning in context and question-answering performances were not correlated. When the language models were evaluated in general, the timpal0l/mdeberta-v3-base-squad2 language model performed the best. Turkish and English language models performed best on the sentiment analysis dataset with an F1 score above 0.85.

Keywords

Language Models, Fine-tune, Hugging Face, LLM

References

[1] J. Jones, W. Jiang, N. Synovic, G. Thiruvathukal, and J. Davis, “What do we know about Hugging Face? A systematic literature review and quantitative validation of qualitative claims,” in Proc. of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. 2024, pp. 13–24.
[2] A. Ait, J. L. C. Izquierdo and J. Cabot, “HFCommunity: A Tool to Analyze the Hugging Face Hub Community,” in Proc. 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). Taipa, Macao, 2023, pp. 728-732, doi:10.1109/SANER56733.2023.00080
[3] Z. Hussain, M. Binz, R. Mata et al. “A tutorial on open-source large language models for behavioural science,” Behav Res 56, pp. 8214–8237, 2024. doi:10.3758/s13428-024-02455-8
[4] S. M. Jain, “Introduction to Transformers for NLP: With the Hugging Face Library and Models to Solve Problems,” Apress Media LLC, pp. 51-53, 2022, doi: 10.1007/978-1-4842-8844-3
[5] F. Pepe, V. Nardone, A. Mastropaolo, G. Bavota, G. Canfora, and M. Di Penta, “How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study,” in Proc. of the 32nd IEEE/ACM International Conference on Program Comprehension. 2024, pp. 370–381.
[6] Hugging Face Inc., https://Huggingface.Co/ (accessed August. 13, 2024)
[7] Y. Shen, K. Song, X. Tan, D. Li, W. Lu and Y. Zhuang, “HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. Advances in Neural Information,” in Proc. Systems 36, New Orleans, USA, 2023, pp. 38154—38180, doi: 10.48550/arXiv.2303.17580.
[8] A. Kathikar, A. Nair, B. Lazarine, A. Sachdeva and S. Samtani, “Assessing the Vulnerabilities of the Open-Source Artificial Intelligence (AI) Landscape: A Large-Scale Analysis of the Hugging Face Platform,” in Proc. 2023 IEEE International Conference on Intelligence and Security Informatics, Charlotte, NC, USA, 2023, pp.1-6, doi: 10.1109/ISI58743.2023.10297271.
[9] A. Pourkeyvan, R. Safa and A. Sorourkhah, “Harnessing the Power of Hugging Face Transformers for Predicting Mental Health Disorders in Social Networks,” IEEE Access, 12, pp. 28025-28035, 2024, doi: 10.1109/ACCESS.2024.3366653
[10] C. Osborrne, J. Ding and H. R. Kirk, “The AI community building the future? A quantitative analysis of development activity on Hugging Face Hub,” Journal of Computational Social Science, vol.7, no.1, pp. 2432-2725, 2024, doi https://doi.org/10.1007/s42001-024-00300-8
[11] J. Castaño, M. F. Silverio, X. Franch and J. Bogner, “Analyzing the Evolution and Maintenance of ML Models on Hugging Face,” in Proc. of the 21st International Conference on Mining Software Repositories, New York, NY, USA, 2024, pp. 607–618, doi: 10.1145/3643991.3644898
[12] E. Dogan, M. E. Uzun, A. Uz, H. Seyrek, A. Zeer, E. Sevi et al. “Performance Comparison of Turkish Language Models,” arXiv e-prints, 2024, arXiv:2404.17010.
[13] Open llm leaderboard, a hugging face space by huggingfaceh4, https://huggingface.co/open-llm-leaderboard. (accessed August. 13, 2024)
[14] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch and et al, “Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models”, Transactions on Machine Learning Research, 2023.
[15] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding”, arXiv [cs.CL], 20-Apr-2018, arXiv preprint arXiv:1804.07461.
[16] M. Perkins, L. Furze, J. Roe, and J. Macvaugh, “The Artificial Intelligence Assessment Scale (AIAS): a framework for ethical integration of generative AI in educational assessment,” Journal of University Teaching and Learning Practice, 21(6), 2024. doi: 10.53761/q3azde36
[17] C. Gonsalves, “Contextual assessment design in the age of generative AI,” Journal of Learning Development in Higher Education, (34), 2025. https://doi.org/10.47408/jldhe.vi34.1307
[18] W. X. Zhao, K. Zhao, J. Li, T. Tang, X. Wang and et al., “A survey of large language models”, arXiv [cs.CL], 31-Mar-2023, doi: 10.48550/arXiv.2303.18223
[19] P. He and J. Gao, “DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing’, in Proc. The Eleventh International Conference on Learning Representations, Kigali, Ruanda, 2023, pp. 1–16.
[20] S. Yildirim, “Fine-tuning Transformer-based encoder for Turkish Language understanding tasks”, arXiv [cs.CL], 30-Jan-2024, doi: 10.48550/arXiv.2401.17396
[21] M. İncidelen, Hugging Face Inc., https://Huggingface.Co/Incidelen/Bert-Base-Turkish-Cased-Qa (accessed August 13, 2024)
[22] Y. E. Emik, Hugging Face Inc., https://Huggingface.Co/Yunusemreemik/Logo-Qna-Model (accessed August 13, 2024)
[23] Ö. Gündeş, Hugging Face Inc., https://Huggingface.Co/Ozcangundes/Mt5-Multitask-Qa-Qg-Turkish (accessed August 13, 2024)
[24] Deepset, Hugging Face Inc., https://huggingface.co/deepset/roberta-base-squad2 (accessed August 13, 2024)
[25] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, in Proc. of NAACL-HLT 2019, Stroudsburg, PA, USA, 2019, pp. 4171–4186.
[26] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter”, arXiv [cs.CL], 2019, doi:10.48550/arXiv.1910.01108
[27] naytin, Hugging Face Inc. https://huggingface.co/datasets/naytin/ai2_arc_tr (accessed September 12, 2024)
[28] Hugging Face Inc. https://huggingface.co/datasets/winvoker/turkish-sentiment-analysis-dataset (accessed September 12, 2024)
[29] naytin, Hugging Face Inc. https://huggingface.co/datasets/naytin/hellaswag_tr (accessed September 12, 2024)
[30] J. Tiedemann and S. Thottingal, “OPUS-MT Building open translation services for the World”, in Proc. the 22nd Annual Conference of the European Association for Machine Translation, 2020, pp. 479–480.
[31] Y. Gu, L. Dong, F. Wei, and M. Huang, “Pre-training to learn in context”, arXiv [cs.CL], 15-May-2023, doi: 10.48550/arXiv.2305.09137
[32] C. Y. Lin, “Rouge: A package for automatic evaluation of summaries”, in Text summarization branches out, 2004, pp. 74–81.
[33] D. Chicco and G. Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,” BMC Genomics 21, 6, 2020. https://doi.org/10.1186/s12864-019-6413-7.
[34] H. Huang, H. Xu, X. Wang and W. Silamu, “Maximum F1-Score Discriminative Training Criterion for Automatic Mispronunciation Detection,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 4, pp. 787-797, April 2015, doi: 10.1109/TASLP.2015.2409733
[35] M. Barbella and G. Tortora, “Rouge metric evaluation for text summarization techniques,” Available at SSRN 4120317. http://dx.doi.org/10.2139/ssrn.4120317
[36] A. E. Elo, The Rating of Chessplayers, Past and Present. New York: Arco Publishing, 1978.
[37] R. Herbrich and T. Graepel, TrueSkillTM: A Bayesian skill rating system. Microsoft Research, 2006.
[38] T. Wolf, L. Debut, V. Chaumond, C. Delangue, A.Moi, P. Cistac end et al., “Transformers: State-of-the-Art Natural Language Processing”, The 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA, pp. 38–45.
[39] Y. Zhang et al., “DIALOGPT: Large-scale generative pre-training for conversational response generation”, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Online, 2020, doi:10.18653/v1/2020.acl-demos.30
[40] S. Choo and W. Kim, “A study on the evaluation of tokenizer performance in natural language processing”, Appl. Artif. Intell., vol. 37, no. 1, Dec. 2023.

There are 40 citations in total.

Details

Primary Language	English
Subjects	Natural Language Processing
Journal Section	Research Articles
Authors	İnayet Hakkı Cizmeci 0000-0001-6202-4807 Kerem Gencer 0000-0002-2914-1056
Publication Date	June 30, 2025
Submission Date	December 11, 2024
Acceptance Date	May 5, 2025
Published in Issue	Year 2025 Issue: 061

Cite

IEEE	İ. H. Cizmeci and K. Gencer, “Performance analysis of the most downloaded Turkish and English language models on the Hugging-Face platform”, JSR-A, no. 061, pp. 13–24, June 2025.

Download Cover Image

Article Files

Full Text