This study analyzes the performance of the most popularly downloaded language models on the Hugging Face platform. For this purpose, the five most downloaded language models in Turkish and English were used. The analysis was evaluated in three phases. These stages were contextual learning, question and answer, and expert evaluation. ARC, Turkish sentiment analysis, Hellaswag, and MMLU datasets were used for contextual learning. For the question-and-answer test, the models trained with the text file created were asked questions from the text. Finally, six experts evaluated the answers given by the models from the developed mobile application. F1 score was used for context evaluation, Rouge-1, Rouge-2, and Rouge-L metrics were used for question and answer, and Elo and TrueSkill metrics were used for expert evaluations. The correlations of these metrics were calculated, and it was seen that there was a correlation of 0.74 between expert evaluations and question-answer performances. It was also observed that learning in context and question-answering performances were not correlated. When the language models were evaluated in general, the timpal0l/mdeberta-v3-base-squad2 language model performed the best. Turkish and English language models performed best on the sentiment analysis dataset with an F1 score above 0.85.
Primary Language | English |
---|---|
Subjects | Natural Language Processing |
Journal Section | Research Articles |
Authors | |
Publication Date | June 30, 2025 |
Submission Date | December 11, 2024 |
Acceptance Date | May 5, 2025 |
Published in Issue | Year 2025 Issue: 061 |