Research Article
BibTex RIS Cite

Improving Text-to-Sql Conversion for Low-Resource Languages Using Large Language Models

Year 2025, Volume: 14 Issue: 1, 163 - 178, 26.03.2025
https://doi.org/10.17798/bitlisfen.1561298

Abstract

Accurate text-to-SQL conversion remains a challenge, particularly for low-resource languages like Turkish. This study explores the effectiveness of large language models (LLMs) in translating Turkish natural language queries into SQL, introducing a two-stage fine-tuning approach to enhance performance. Three widely used LLMs Llama2, Llama3, and Phi3 are fine-tuned under two different training strategies, direct SQL fine-tuning and sequential fine-tuning, where models are first trained on Turkish instruction data before SQL fine-tuning. A total of six model configurations are evaluated using execution accuracy and logical form accuracy. The results indicate that Phi3 models outperform both Llama-based models and previously reported methods, achieving execution accuracy of up to 99.95% and logical form accuracy of 99.95%, exceeding the best scores in the literature by 5–10%. The study highlights the effectiveness of instruction-based fine-tuning in improving SQL query generation. It provides a detailed comparison of Llama-based and Phi-based models in text-to-SQL tasks, introduces a structured fine-tuning methodology designed for low-resource languages, and presents empirical evidence demonstrating the positive impact of strategic data augmentation on model performance. These findings contribute to the advancement of natural language interfaces for databases, particularly in languages with limited NLP resources. The scripts and models used during the training and testing phases of the study are publicly available at https://github.com/emirozturk/TT2SQL.

Ethical Statement

The study is complied with research and publication ethics.

References

  • K. Mohamad and K. M. Karaoğlan, “Enhancing Deep Learning-Based Sentiment Analysis Using Static and Contextual Language Models,” Bitlis Eren Üniversitesi Fen Bilimleri Dergisi, vol. 12, no. 3, pp. 712–724, 2023.
  • K. M. Karaoğlan, “Novel approaches for fake news detection based on attention-based deep multiple-instance learning using contextualized neural language models,” Neurocomputing, vol. 602, p. 128263, 2024.
  • K. M. Karaoglan and O. Findik, “Enhancing Aspect Category Detection Through Hybridised Contextualised Neural Language Models: A Case Study In Multi-Label Text Classification,” Comput J, p. bxae004, 2024.
  • D. Gao et al., “Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation”, arXiv preprint arXiv:2308.15363, 2023.
  • A. Kumar, P. Nagarkar, P. Nalhe, ve S. Vijayakumar, “Deep Learning Driven Natural Language Text-to-SQL Query Conversion: A Survey”, arXiv preprint arXiv:2208.04415, 2022.
  • C. Wang, A. Cheung, ve R. Bodik, “Synthesizing Highly Expressive SQL Queries from Input-Output Examples”, içinde Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2017, ss. 452-466.
  • A. Giordani ve A. Moschitti, “Translating Questions to SQL Queries with Generative Parsers Discriminatively Reranked”, içinde Proceedings of COLING 2012: Posters, 2012, ss. 401-410.
  • I. Gür, S. Yavuz, Y. Su, ve X. Yan, “DialSQL: Dialogue Based Structured Query Generation”, içinde Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, ss. 1339-1349.
  • J. Guo et al., “Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation”, arXiv preprint arXiv:1905.08205, 2019.
  • B. Bogin, M. Gardner, ve J. Berant, “Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing”, arXiv preprint arXiv:1905.06241, 2019.
  • C. Wang et al., “Robust Text-to-SQL Generation with Execution-Guided Decoding”, arXiv preprint arXiv:1807.03100, 2018.
  • T. Shi, K. Tatwawadi, K. Chakrabarti, Y. Mao, O. Polozov, ve W. Chen, “InCSQL: Training Incremental Text-to-SQL Parsers with Non-Deterministic Oracles”, arXiv preprint arXiv:1809.05054, 2018.
  • A. Liu, X. Hu, L. Wen, ve P. Yu, “A Comprehensive Evaluation of ChatGPT’s Zero-Shot Text-to-SQL Capability”, arXiv preprint arXiv:2303.13547, 2023.
  • Q. Min, Y. Shi, ve Y. Zhang, “A Pilot Study for Chinese SQL Semantic Parsing”, arXiv preprint arXiv:1909.13293, 2019.
  • A. T. Nguyen, M. H. Dao, ve D. Q. Nguyen, “A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese”, arXiv preprint arXiv:2010.01891, 2020.
  • S. Ningyuan, Y. Xuefeng, ve L. Yunfeng, “TableQA: A Large-Scale Chinese Text-to-SQL Dataset for Table-Aware SQL Generation”. 2020. Erişim adresi: https://arxiv.org/abs/2006.01234
  • P. Shi, R. Zhang, H. Bai, ve J. Lin, “XRICL: Cross-Lingual Retrieval-Augmented In-Context Learning for Cross-Lingual Text-to-SQL Semantic Parsing”, arXiv preprint arXiv:2210.13693, 2022.
  • P. Shi et al., “Cross-Lingual Text-to-SQL Semantic Parsing with Representation Mixup”, içinde Findings of the Association for Computational Linguistics: EMNLP 2022, 2022, ss. 5296-5306.
  • M. A. Jose ve F. G. Cozman, “A Multilingual Translator to SQL with Database Schema Pruning to Improve Self-Attention”, International Journal of Information Technology, c. 15, sy 6, ss. 3015-3023, 2023.
  • N. Deng, Y. Chen, ve Y. Zhang, “Recent Advances in Text-to-SQL: A Survey of What We Have and What We Expect”, arXiv preprint arXiv:2208.10099, 2022.
  • T. Yu et al., “Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task”, arXiv preprint arXiv:1809.08887, 2018.
  • A. Heakl, Y. Mohamed, ve A. B. Zaky, “AraSpider: Democratizing Arabic-to-SQL”, arXiv preprint arXiv:2402.07448, 2024.
  • D. Bakshandaeva, O. Somov, E. Dmitrieva, V. Davydova, ve E. Tutubalina, “PAUQ: Text-to-SQL in Russian”, içinde Findings of the Association for Computational Linguistics: EMNLP 2022, 2022, ss. 2355-2376.
  • O. Somov ve E. Tutubalina, “Shifted PAUQ: Distribution Shift in Text-to-SQL”, içinde Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP, 2023, ss. 214-220.
  • L. Dou et al., “MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic Parsing”, içinde Proceedings of the AAAI Conference on Artificial Intelligence, 2023, ss. 12745-12753.
  • A. B. Kanburoğlu ve F. B. Tek, “TUR2SQL: A Cross-Domain Turkish Dataset For Text-to-SQL”, içinde 2023 8th International Conference on Computer Science and Engineering (UBMK), IEEE, 2023, ss. 206-211.
  • X. Xu, C. Liu, ve D. Song, “SQLNet: Generating Structured Queries from Natural Language without Reinforcement Learning”, arXiv preprint arXiv:1711.04436, 2017.
  • T. Kurtuluş, “turkish_73k_instruct_extended”, HuggingFace Dataset Repository. HuggingFace.co, 2024. Erişim adresi: https://huggingface.co/datasets/tolgadev/turkish_73k_instruct_extended
  • H. Touvron et al., “LLaMA 2: Open Foundation and Fine-Tuned Chat Models”, arXiv preprint arXiv:2307.09288, 2023.
  • AI@Meta, “LLaMA 3 Model Card”, 2024. Erişim adresi: https://github.com/meta-llama/llama3/blob/main/MODEL\_CARD.md
  • M. Abdin et al., “Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone”, arXiv preprint arXiv:2404.14219, 2024.
  • B. Qin et al., “A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future Directions”, arXiv preprint arXiv:2208.13629, 2022.
  • V. Zhong, C. Xiong, ve R. Socher, “Seq2SQL: Generating Structured Queries from Natural Language Using Reinforcement Learning”, arXiv preprint arXiv:1709.00103, 2017.
  • A. B. Kanburoğlu ve F. B. Tek, “Text-to-SQL: A Methodical Review of Challenges and Models”, Turkish Journal of Electrical Engineering and Computer Sciences, c. 32, sy 3, ss. 403-419, 2024.
Year 2025, Volume: 14 Issue: 1, 163 - 178, 26.03.2025
https://doi.org/10.17798/bitlisfen.1561298

Abstract

References

  • K. Mohamad and K. M. Karaoğlan, “Enhancing Deep Learning-Based Sentiment Analysis Using Static and Contextual Language Models,” Bitlis Eren Üniversitesi Fen Bilimleri Dergisi, vol. 12, no. 3, pp. 712–724, 2023.
  • K. M. Karaoğlan, “Novel approaches for fake news detection based on attention-based deep multiple-instance learning using contextualized neural language models,” Neurocomputing, vol. 602, p. 128263, 2024.
  • K. M. Karaoglan and O. Findik, “Enhancing Aspect Category Detection Through Hybridised Contextualised Neural Language Models: A Case Study In Multi-Label Text Classification,” Comput J, p. bxae004, 2024.
  • D. Gao et al., “Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation”, arXiv preprint arXiv:2308.15363, 2023.
  • A. Kumar, P. Nagarkar, P. Nalhe, ve S. Vijayakumar, “Deep Learning Driven Natural Language Text-to-SQL Query Conversion: A Survey”, arXiv preprint arXiv:2208.04415, 2022.
  • C. Wang, A. Cheung, ve R. Bodik, “Synthesizing Highly Expressive SQL Queries from Input-Output Examples”, içinde Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2017, ss. 452-466.
  • A. Giordani ve A. Moschitti, “Translating Questions to SQL Queries with Generative Parsers Discriminatively Reranked”, içinde Proceedings of COLING 2012: Posters, 2012, ss. 401-410.
  • I. Gür, S. Yavuz, Y. Su, ve X. Yan, “DialSQL: Dialogue Based Structured Query Generation”, içinde Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, ss. 1339-1349.
  • J. Guo et al., “Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation”, arXiv preprint arXiv:1905.08205, 2019.
  • B. Bogin, M. Gardner, ve J. Berant, “Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing”, arXiv preprint arXiv:1905.06241, 2019.
  • C. Wang et al., “Robust Text-to-SQL Generation with Execution-Guided Decoding”, arXiv preprint arXiv:1807.03100, 2018.
  • T. Shi, K. Tatwawadi, K. Chakrabarti, Y. Mao, O. Polozov, ve W. Chen, “InCSQL: Training Incremental Text-to-SQL Parsers with Non-Deterministic Oracles”, arXiv preprint arXiv:1809.05054, 2018.
  • A. Liu, X. Hu, L. Wen, ve P. Yu, “A Comprehensive Evaluation of ChatGPT’s Zero-Shot Text-to-SQL Capability”, arXiv preprint arXiv:2303.13547, 2023.
  • Q. Min, Y. Shi, ve Y. Zhang, “A Pilot Study for Chinese SQL Semantic Parsing”, arXiv preprint arXiv:1909.13293, 2019.
  • A. T. Nguyen, M. H. Dao, ve D. Q. Nguyen, “A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese”, arXiv preprint arXiv:2010.01891, 2020.
  • S. Ningyuan, Y. Xuefeng, ve L. Yunfeng, “TableQA: A Large-Scale Chinese Text-to-SQL Dataset for Table-Aware SQL Generation”. 2020. Erişim adresi: https://arxiv.org/abs/2006.01234
  • P. Shi, R. Zhang, H. Bai, ve J. Lin, “XRICL: Cross-Lingual Retrieval-Augmented In-Context Learning for Cross-Lingual Text-to-SQL Semantic Parsing”, arXiv preprint arXiv:2210.13693, 2022.
  • P. Shi et al., “Cross-Lingual Text-to-SQL Semantic Parsing with Representation Mixup”, içinde Findings of the Association for Computational Linguistics: EMNLP 2022, 2022, ss. 5296-5306.
  • M. A. Jose ve F. G. Cozman, “A Multilingual Translator to SQL with Database Schema Pruning to Improve Self-Attention”, International Journal of Information Technology, c. 15, sy 6, ss. 3015-3023, 2023.
  • N. Deng, Y. Chen, ve Y. Zhang, “Recent Advances in Text-to-SQL: A Survey of What We Have and What We Expect”, arXiv preprint arXiv:2208.10099, 2022.
  • T. Yu et al., “Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task”, arXiv preprint arXiv:1809.08887, 2018.
  • A. Heakl, Y. Mohamed, ve A. B. Zaky, “AraSpider: Democratizing Arabic-to-SQL”, arXiv preprint arXiv:2402.07448, 2024.
  • D. Bakshandaeva, O. Somov, E. Dmitrieva, V. Davydova, ve E. Tutubalina, “PAUQ: Text-to-SQL in Russian”, içinde Findings of the Association for Computational Linguistics: EMNLP 2022, 2022, ss. 2355-2376.
  • O. Somov ve E. Tutubalina, “Shifted PAUQ: Distribution Shift in Text-to-SQL”, içinde Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP, 2023, ss. 214-220.
  • L. Dou et al., “MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic Parsing”, içinde Proceedings of the AAAI Conference on Artificial Intelligence, 2023, ss. 12745-12753.
  • A. B. Kanburoğlu ve F. B. Tek, “TUR2SQL: A Cross-Domain Turkish Dataset For Text-to-SQL”, içinde 2023 8th International Conference on Computer Science and Engineering (UBMK), IEEE, 2023, ss. 206-211.
  • X. Xu, C. Liu, ve D. Song, “SQLNet: Generating Structured Queries from Natural Language without Reinforcement Learning”, arXiv preprint arXiv:1711.04436, 2017.
  • T. Kurtuluş, “turkish_73k_instruct_extended”, HuggingFace Dataset Repository. HuggingFace.co, 2024. Erişim adresi: https://huggingface.co/datasets/tolgadev/turkish_73k_instruct_extended
  • H. Touvron et al., “LLaMA 2: Open Foundation and Fine-Tuned Chat Models”, arXiv preprint arXiv:2307.09288, 2023.
  • AI@Meta, “LLaMA 3 Model Card”, 2024. Erişim adresi: https://github.com/meta-llama/llama3/blob/main/MODEL\_CARD.md
  • M. Abdin et al., “Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone”, arXiv preprint arXiv:2404.14219, 2024.
  • B. Qin et al., “A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future Directions”, arXiv preprint arXiv:2208.13629, 2022.
  • V. Zhong, C. Xiong, ve R. Socher, “Seq2SQL: Generating Structured Queries from Natural Language Using Reinforcement Learning”, arXiv preprint arXiv:1709.00103, 2017.
  • A. B. Kanburoğlu ve F. B. Tek, “Text-to-SQL: A Methodical Review of Challenges and Models”, Turkish Journal of Electrical Engineering and Computer Sciences, c. 32, sy 3, ss. 403-419, 2024.
There are 34 citations in total.

Details

Primary Language English
Subjects Natural Language Processing
Journal Section Research Article
Authors

Emir Öztürk 0000-0002-3734-5171

Publication Date March 26, 2025
Submission Date October 4, 2024
Acceptance Date February 12, 2025
Published in Issue Year 2025 Volume: 14 Issue: 1

Cite

IEEE E. Öztürk, “Improving Text-to-Sql Conversion for Low-Resource Languages Using Large Language Models”, Bitlis Eren Üniversitesi Fen Bilimleri Dergisi, vol. 14, no. 1, pp. 163–178, 2025, doi: 10.17798/bitlisfen.1561298.

Bitlis Eren University
Journal of Science Editor
Bitlis Eren University Graduate Institute
Bes Minare Mah. Ahmet Eren Bulvari, Merkez Kampus, 13000 BITLIS