Promoter Classification in Human Genome via DNA2Vec and UNK-Aware Deep Neural Networks

Aleyna Mengen; Emre Delibaş

Research Article

DNA2Vec ve UNK Duyarlı Derin Sinir Ağları ile İnsan Genomunda Promoter Sınıflandırması

Year 2025, Volume: 3 Issue: 1, 92 - 99, 30.06.2025

Aleyna Mengen , Emre Delibaş

Abstract

Bu çalışma, Homo sapiens genomunda promoter ve non-promoter DNA dizilerinin ayrımını sağlamak amacıyla DNA2Vec tabanlı gömülü temsiller ve UNK karakter duyarlılığı ile güçlendirilmiş derin sinir ağı (DNN) mimarisini bir araya getiren hibrit bir sınıflandırma yaklaşımı önermektedir. Model, bilinmeyen veya düşük güvenilirlikteki k-mer’leri özel olarak başlatılan UNK vektörü ile temsil ederek bağlamsal bilgi kaybını önlemekte ve genelleme kapasitesini artırmaktadır. Veri seti, eşit sayıda promoter ve non-promoter diziden oluşturulmuş, değerlendirmede stratified 5-fold çapraz doğrulama uygulanmıştır. Optimize edilen model; test setinde %85.03 doğruluk, 0.8786 kesinlik, 0.8128 duyarlılık, 0.8444 F1 skoru ve 0.9376 ROC-AUC başarısı elde etmiş ve insan genomu üzerinde yapılan çalışmalarda literatürdeki pek çok karmaşık modele kıyasla daha iyi veya benzer sonuçlar göstermiştir. Sonuçlar, önerilen mimarinin güçlü, yorumlanabilir ve hesaplama açısından verimli bir alternatif sunduğunu ve motif-bağımsız öğrenme yeteneğiyle biyoinformatik uygulamalarda pratik olarak kullanılabileceğini göstermektedir. Gelecek çalışmalarda türler arası genelleme ve Transformer gibi dikkat tabanlı modellerle entegrasyonun araştırılması önerilmektedir.

Keywords

Promoter Sınıflandırması, DNA2Vec, Derin Sinir Ağı, UNK Vektörü, Biyoinformatik

References

[1] M. W. Libbrecht and W. S. Noble, “Machine learning applications in genetics and genomics,” Nat Rev Genet, vol. 16, no. 6, pp. 321–332, May 2015, doi: 10.1038/NRG3920;SUBJMETA=114,1305,208,212,2415,631;KWRD=GENOMICS,MACHINE+LEARNING,STATISTICAL+METHODS.
[2] S. T. Smale and J. T. Kadonaga, “The RNA polymerase II core promoter,” Annu Rev Biochem, vol. 72, no. Volume 72, 2003, pp. 449–479, Jul. 2003, doi: 10.1146/ANNUREV.BIOCHEM.72.121801.161520/CITE/REFWORKS.
[3] P. Carninci et al., “Genome-wide analysis of mammalian promoter architecture and evolution,” Nat Genet, vol. 38, no. 6, pp. 626–635, Jun. 2006, doi: 10.1038/NG1789,.
[4] J. W. Fickett and C. shung Tung, “Assessment of protein coding measures,” Nucleic Acids Res, vol. 20, no. 24, pp. 6441–6450, Dec. 1992, doi: 10.1093/NAR/20.24.6441.
[5] P. Ng, “Dna2vec: Consistent vector representations of variable-length k-mers,” arXiv preprint arXiv:1701.06279, Jan. 2017, Accessed: Jun. 03, 2025. [Online]. Available: https://arxiv.org/pdf/1701.06279
[6] L. Chen, C. Cai, V. Chen, and X. Lu, “Learning a hierarchical representation of the yeast transcriptomic machinery using an autoencoder model,” BMC Bioinformatics, vol. 17, no. 1, pp. 97–107, Jan. 2016, doi: 10.1186/S12859-015-0852-1/FIGURES/6.
[7] D. R. Kelley, J. Snoek, and J. L. Rinn, “Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks,” Genome Res, vol. 26, no. 7, pp. 990–999, Jul. 2016, doi: 10.1101/GR.200535.115.
[8] H. Zeng, M. D. Edwards, G. Liu, and D. K. Gifford, “Convolutional neural network architectures for predicting DNA–protein binding,” Bioinformatics, vol. 32, no. 12, pp. i121–i127, Jun. 2016, doi: 10.1093/BIOINFORMATICS/BTW255.
[9] D. H. A. Mai, L. T. Nguyen, and E. Y. Lee, “TSSNote-CyaPromBERT: Development of an integrated platform for highly accurate promoter prediction and visualization of Synechococcus sp. and Synechocystis sp. through a state-of-the-art natural language processing model BERT,” Front Genet, vol. 13, p. 1067562, Nov. 2022, doi: 10.3389/FGENE.2022.1067562/BIBTEX.
[10] Y. Ji, Z. Zhou, H. Liu, and R. V. Davuluri, “DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome,” Bioinformatics, vol. 37, no. 15, pp. 2112–2120, Aug. 2021, doi: 10.1093/BIOINFORMATICS/BTAB083.
[11] D. Hernández, N. Jara, M. Araya, R. E. Durán, and C. Buil-Aranda, “PromoterLCNN: A Light CNN-Based Promoter Prediction and Classification Model,” Genes (Basel), vol. 13, no. 7, Jul. 2022, doi: 10.3390/GENES13071126,.
[12] R. Amin et al., “iPromoter-BnCNN: a novel branched CNN-based predictor for identifying and classifying sigma promoters,” Bioinformatics, vol. 36, no. 19, pp. 4869–4875, Dec. 2020, doi: 10.1093/BIOINFORMATICS/BTAA609.
[13] S. Sasikala and T. Ratha Jeyalakshmi, “GSCNN: a composition of CNN and Gibb Sampling computational strategy for predicting promoter in bacterial genomes,” International Journal of Information Technology (Singapore), vol. 13, no. 2, pp. 493–499, Apr. 2021, doi: 10.1007/S41870-020-00565-Y.
[14] Q. Zhang, Y. Wei, and L. Liu, “GraphPro: An interpretable graph neural network-based model for identifying promoters in multiple species,” Comput Biol Med, vol. 180, p. 108974, Sep. 2024, doi: 10.1016/J.COMPBIOMED.2024.108974.
[15] U. M. Akkaya and H. Kalkan, “Classification of DNA Sequences with k-mers Based Vector Representations,” Proceedings - 2021 Innovations in Intelligent Systems and Applications Conference, ASYU 2021, 2021, doi: 10.1109/ASYU52992.2021.9599084.
[16] L. Deng, H. Wu, and H. Liu, “D2VCB: A Hybrid Deep Neural Network for the Prediction of in-vivo Protein-DNA Binding from Combined DNA Sequence,” Proceedings - 2019 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2019, pp. 74–77, Nov. 2019, doi: 10.1109/BIBM47256.2019.8983051.
[17] V. Rajendran, H. Anandaram, S. Sachin Kumar, K. P. Soman, and S. Dhivya, “A Comparative Analysis of Machine Learning and Deep Learning Approaches for Circular RNA Classification,” Proceedings of International Conference on Contemporary Computing and Informatics, IC3I 2023, pp. 1026–1034, 2023, doi: 10.1109/IC3I59117.2023.10397741.
[18] S. Ganesan, S. Sachin Kumar, and K. P. Soman, “Biological Sequence Embedding Based Classification for MERS and SARS,” Communications in Computer and Information Science, vol. 1440 CCIS, pp. 475–487, 2021, doi: 10.1007/978-3-030-81462-5_43/FIGURES/4.
[19] L. Shi and B. Chen, “LSHvec: A vector representation of DNA sequences using locality sensitive hashing and FastText word embeddings,” Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021, Jan. 2021, doi: 10.1145/3459930.3469521.
[20] R. Dreos, G. Ambrosini, R. C. Périer, and P. Bucher, “EPD and EPDnew, high-quality promoter resources in the next-generation sequencing era,” Nucleic Acids Res, vol. 41, no. Database issue, p. D157, Jan. 2012, doi: 10.1093/NAR/GKS1233.
[21] P. Bucher, “Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences,” J Mol Biol, vol. 212, no. 4, pp. 563–578, Apr. 1990, doi: 10.1016/0022-2836(90)90223-9.
[22] P. J. Wei, Z. Z. Pang, L. J. Jiang, D. Y. Tan, Y. Sen Su, and C. H. Zheng, “Promoter prediction in nannochloropsis based on densely connected convolutional neural networks,” Methods, vol. 204, pp. 38–46, Aug. 2022, doi: 10.1016/J.YMETH.2022.03.017.
[23] B. Sahu et al., “Sequence determinants of human gene regulatory elements,” Nature Genetics 2022 54:3, vol. 54, no. 3, pp. 283–294, Feb. 2022, doi: 10.1038/s41588-021-01009-4.
[24] E. C. Alley, G. Khimulya, S. Biswas, M. AlQuraishi, and G. M. Church, “Unified rational protein engineering with sequence-based deep representation learning,” Nat Methods, vol. 16, no. 12, pp. 1315–1322, Dec. 2019, doi: 10.1038/S41592-019-0598 1;SUBJMETA=114,1305,338,469,552,61,631;KWRD=MACHINE+LEARNING,PROTEIN+DESIGN,SYNTHETIC+BIOLOGY.
[25]T. Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal Loss for Dense Object Detection,” IEEE Trans Pattern Anal Mach Intell, vol. 42, no. 2, pp. 318–327, Feb. 2020, doi: 10.1109/TPAMI.2018.2858826.
[26] I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization,” 7th International Conference on Learning Representations, ICLR 2019, Nov. 2017, Accessed: Jun. 04, 2025. [Online]. Available: https://arxiv.org/pdf/1711.05101
[27] M. M. Uddin, J. Shiddike, A. Ahmed, and T. Ahsan, “Promoter Prediction in DNA Classification Using Machine Learning Algorithms,” Proceedings - 2024 3rd International Conference on Sentiment Analysis and Deep Learning, ICSADL 2024, pp. 254–260, 2024, doi: 10.1109/ICSADL61749.2024.00047.

Promoter Classification in Human Genome via DNA2Vec and UNK-Aware Deep Neural Networks

Year 2025, Volume: 3 Issue: 1, 92 - 99, 30.06.2025

Aleyna Mengen , Emre Delibaş

Abstract

This study proposes a new hybrid model combining DNA2Vec-based embedded representations with UNK character support and a deep neural network (DNN) architecture for the classification of promoter and non-promoter DNA sequences belonging to the Homo sapiens genome. The model's objective is twofold: first, to minimize the loss of contextual information, and second, to enhance the generalization performance by representing unknown or low-confidence k-mer sequences with an UNK vector. The model, which was structured with a GELU activation function and an AdamW optimization algorithm, achieved strong and balanced results, including 85.03% accuracy, 0.9376 ROC-AUC, and 0.8444 F1 score, when evaluated using a stratified 5-fold cross-validation method. The findings indicate that the proposed structure provides a more straightforward yet effective approach in comparison to the more complex models documented in the extant literature. Furthermore, this architecture provides pragmatic and comprehensible solutions in bioinformatics applications, particularly since it facilitates motif-independent learning. Future work should address the generalization capacity be increased across species and that the integration with Transformer-based models be evaluated in future studies.

Keywords

Promoter Classification, DNA2Vec, Deep Neural Network, UNK Vector, Bioinformatics

References

[1] M. W. Libbrecht and W. S. Noble, “Machine learning applications in genetics and genomics,” Nat Rev Genet, vol. 16, no. 6, pp. 321–332, May 2015, doi: 10.1038/NRG3920;SUBJMETA=114,1305,208,212,2415,631;KWRD=GENOMICS,MACHINE+LEARNING,STATISTICAL+METHODS.
[2] S. T. Smale and J. T. Kadonaga, “The RNA polymerase II core promoter,” Annu Rev Biochem, vol. 72, no. Volume 72, 2003, pp. 449–479, Jul. 2003, doi: 10.1146/ANNUREV.BIOCHEM.72.121801.161520/CITE/REFWORKS.
[3] P. Carninci et al., “Genome-wide analysis of mammalian promoter architecture and evolution,” Nat Genet, vol. 38, no. 6, pp. 626–635, Jun. 2006, doi: 10.1038/NG1789,.
[4] J. W. Fickett and C. shung Tung, “Assessment of protein coding measures,” Nucleic Acids Res, vol. 20, no. 24, pp. 6441–6450, Dec. 1992, doi: 10.1093/NAR/20.24.6441.
[5] P. Ng, “Dna2vec: Consistent vector representations of variable-length k-mers,” arXiv preprint arXiv:1701.06279, Jan. 2017, Accessed: Jun. 03, 2025. [Online]. Available: https://arxiv.org/pdf/1701.06279
[6] L. Chen, C. Cai, V. Chen, and X. Lu, “Learning a hierarchical representation of the yeast transcriptomic machinery using an autoencoder model,” BMC Bioinformatics, vol. 17, no. 1, pp. 97–107, Jan. 2016, doi: 10.1186/S12859-015-0852-1/FIGURES/6.
[7] D. R. Kelley, J. Snoek, and J. L. Rinn, “Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks,” Genome Res, vol. 26, no. 7, pp. 990–999, Jul. 2016, doi: 10.1101/GR.200535.115.
[8] H. Zeng, M. D. Edwards, G. Liu, and D. K. Gifford, “Convolutional neural network architectures for predicting DNA–protein binding,” Bioinformatics, vol. 32, no. 12, pp. i121–i127, Jun. 2016, doi: 10.1093/BIOINFORMATICS/BTW255.
[9] D. H. A. Mai, L. T. Nguyen, and E. Y. Lee, “TSSNote-CyaPromBERT: Development of an integrated platform for highly accurate promoter prediction and visualization of Synechococcus sp. and Synechocystis sp. through a state-of-the-art natural language processing model BERT,” Front Genet, vol. 13, p. 1067562, Nov. 2022, doi: 10.3389/FGENE.2022.1067562/BIBTEX.
[10] Y. Ji, Z. Zhou, H. Liu, and R. V. Davuluri, “DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome,” Bioinformatics, vol. 37, no. 15, pp. 2112–2120, Aug. 2021, doi: 10.1093/BIOINFORMATICS/BTAB083.
[11] D. Hernández, N. Jara, M. Araya, R. E. Durán, and C. Buil-Aranda, “PromoterLCNN: A Light CNN-Based Promoter Prediction and Classification Model,” Genes (Basel), vol. 13, no. 7, Jul. 2022, doi: 10.3390/GENES13071126,.
[12] R. Amin et al., “iPromoter-BnCNN: a novel branched CNN-based predictor for identifying and classifying sigma promoters,” Bioinformatics, vol. 36, no. 19, pp. 4869–4875, Dec. 2020, doi: 10.1093/BIOINFORMATICS/BTAA609.
[13] S. Sasikala and T. Ratha Jeyalakshmi, “GSCNN: a composition of CNN and Gibb Sampling computational strategy for predicting promoter in bacterial genomes,” International Journal of Information Technology (Singapore), vol. 13, no. 2, pp. 493–499, Apr. 2021, doi: 10.1007/S41870-020-00565-Y.
[14] Q. Zhang, Y. Wei, and L. Liu, “GraphPro: An interpretable graph neural network-based model for identifying promoters in multiple species,” Comput Biol Med, vol. 180, p. 108974, Sep. 2024, doi: 10.1016/J.COMPBIOMED.2024.108974.
[15] U. M. Akkaya and H. Kalkan, “Classification of DNA Sequences with k-mers Based Vector Representations,” Proceedings - 2021 Innovations in Intelligent Systems and Applications Conference, ASYU 2021, 2021, doi: 10.1109/ASYU52992.2021.9599084.
[16] L. Deng, H. Wu, and H. Liu, “D2VCB: A Hybrid Deep Neural Network for the Prediction of in-vivo Protein-DNA Binding from Combined DNA Sequence,” Proceedings - 2019 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2019, pp. 74–77, Nov. 2019, doi: 10.1109/BIBM47256.2019.8983051.
[17] V. Rajendran, H. Anandaram, S. Sachin Kumar, K. P. Soman, and S. Dhivya, “A Comparative Analysis of Machine Learning and Deep Learning Approaches for Circular RNA Classification,” Proceedings of International Conference on Contemporary Computing and Informatics, IC3I 2023, pp. 1026–1034, 2023, doi: 10.1109/IC3I59117.2023.10397741.
[18] S. Ganesan, S. Sachin Kumar, and K. P. Soman, “Biological Sequence Embedding Based Classification for MERS and SARS,” Communications in Computer and Information Science, vol. 1440 CCIS, pp. 475–487, 2021, doi: 10.1007/978-3-030-81462-5_43/FIGURES/4.
[19] L. Shi and B. Chen, “LSHvec: A vector representation of DNA sequences using locality sensitive hashing and FastText word embeddings,” Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021, Jan. 2021, doi: 10.1145/3459930.3469521.
[20] R. Dreos, G. Ambrosini, R. C. Périer, and P. Bucher, “EPD and EPDnew, high-quality promoter resources in the next-generation sequencing era,” Nucleic Acids Res, vol. 41, no. Database issue, p. D157, Jan. 2012, doi: 10.1093/NAR/GKS1233.
[21] P. Bucher, “Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences,” J Mol Biol, vol. 212, no. 4, pp. 563–578, Apr. 1990, doi: 10.1016/0022-2836(90)90223-9.
[22] P. J. Wei, Z. Z. Pang, L. J. Jiang, D. Y. Tan, Y. Sen Su, and C. H. Zheng, “Promoter prediction in nannochloropsis based on densely connected convolutional neural networks,” Methods, vol. 204, pp. 38–46, Aug. 2022, doi: 10.1016/J.YMETH.2022.03.017.
[23] B. Sahu et al., “Sequence determinants of human gene regulatory elements,” Nature Genetics 2022 54:3, vol. 54, no. 3, pp. 283–294, Feb. 2022, doi: 10.1038/s41588-021-01009-4.
[24] E. C. Alley, G. Khimulya, S. Biswas, M. AlQuraishi, and G. M. Church, “Unified rational protein engineering with sequence-based deep representation learning,” Nat Methods, vol. 16, no. 12, pp. 1315–1322, Dec. 2019, doi: 10.1038/S41592-019-0598 1;SUBJMETA=114,1305,338,469,552,61,631;KWRD=MACHINE+LEARNING,PROTEIN+DESIGN,SYNTHETIC+BIOLOGY.
[25]T. Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal Loss for Dense Object Detection,” IEEE Trans Pattern Anal Mach Intell, vol. 42, no. 2, pp. 318–327, Feb. 2020, doi: 10.1109/TPAMI.2018.2858826.
[26] I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization,” 7th International Conference on Learning Representations, ICLR 2019, Nov. 2017, Accessed: Jun. 04, 2025. [Online]. Available: https://arxiv.org/pdf/1711.05101
[27] M. M. Uddin, J. Shiddike, A. Ahmed, and T. Ahsan, “Promoter Prediction in DNA Classification Using Machine Learning Algorithms,” Proceedings - 2024 3rd International Conference on Sentiment Analysis and Deep Learning, ICSADL 2024, pp. 254–260, 2024, doi: 10.1109/ICSADL61749.2024.00047.

There are 27 citations in total.

Details

Primary Language	English
Subjects	Data Engineering and Data Science
Journal Section	Research Articles
Authors	Aleyna Mengen 0009-0008-7310-1394 Emre Delibaş 0000-0001-7564-5020
Publication Date	June 30, 2025
Submission Date	June 4, 2025
Acceptance Date	June 16, 2025
Published in Issue	Year 2025 Volume: 3 Issue: 1

Cite

IEEE	A. Mengen and E. Delibaş, “Promoter Classification in Human Genome via DNA2Vec and UNK-Aware Deep Neural Networks”, CÜMFAD, vol. 3, no. 1, pp. 92–99, 2025.

Download Cover Image

Article Files

Full Text