Virus Host Prediction with Metagenomic Features using Support Vector Machine Algorithm and Grid Search Cross Validation Optimization
DOI:
https://doi.org/10.59247/jahir.v2i3.298Keywords:
Virus, Host, Metagenomics, Support vector machine, Grid SearchAbstract
Viruses and bacteria continue to evolve alongside humans. Viruses are spreading too fast and causing a huge loss of life in the world. Viruses play an important role as dangerous pathogens that continue to spread various infectious diseases. Metegenomics is the application of large sequencing technology to genetic material obtained directly from one or more environmental samples, resulting in at least 50Mb random samples and multiple long sequences. It is important to identify the origin of the virus to prevent the spread of outbreaks. Understanding the biology of these viruses and how they affect their ecosystems depends on knowing which host they infect. We can use metagenomic features derived from the viral genome to determine the type of virus host. The activity of predicting virus hosts has traditionally taken a lot of time and effort in the process. Technology can be one of the solutions that can be used to predict virus host types. One of the technologies that can be used is machine learning. We chose one of the machine learning algorithms, SVM, to predict viral hosts with metagenomics features, namely genome size, GC% and number of CDS from viral genomes derived from 7326 viral genomes. The SVM model was further optimised with GS and K-CV methods. This optimisation resulted in an increase in the accuracy value of the model when predicting virus hosts from 80% to 84%.
References
Iis Setiawan Mangkunegara and P. Purwono, “Analysis of DNA Sequence Classification Using SVM Model with Hyperparameter Tuning Grid Search CV,” in 2022 IEEE International Conference on Cybernetics and Computational Intelligence (CyberneticsCom), 2022, pp. 427–432, doi: 10.1109/CyberneticsCom55287.2022.9865624.
L. Lu, S. Su, H. Yang, and S. Jiang, “Antivirals with common targets against highly pathogenic viruses,” Cell, vol. 184, no. 6, pp. 1604–1620, 2021, doi: 10.1016/j.cell.2021.02.013.
G. Abbas et al., “Synthesis and investigation of anti-COVID19 ability of ferrocene Schiff base derivatives by quantum chemical and molecular docking,” J. Mol. Struct., vol. 1253, 2022, doi: 10.1016/j.molstruc.2021.132242.
E. Sobhanie et al., “Recent trends and advancements in electrochemiluminescence biosensors for human virus detection,” TrAC - Trends Anal. Chem., vol. 157, p. 116727, 2022, doi: 10.1016/j.trac.2022.116727.
M. Sevvana, T. Klose, and M. G. Rossmann, “Principles of Virus Structure,” Encycl. Virol. (Fourth Ed., vol. 1, pp. 257–277, 2021, doi: https://doi.org/10.1016/B978-0-12-814515-9.00033-3.
M. Ramazzotti and G. Bacci, “16S rRNA-Based Taxonomy Profiling in the Metagenomics Era,” in Metagenomics: Perspectives, Methods, and Applications, Academic Press, 2018, pp. 103–119.
M. E. Walker, J. B. Simpson, and M. R. Redinbo, “A structural metagenomics pipeline for examining the gut microbiome,” Curr. Opin. Struct. Biol., vol. 75, p. 102416, Aug. 2022, doi: 10.1016/J.SBI.2022.102416.
Q. Hou, F. Pucci, F. Pan, F. Xue, M. Rooman, and Q. Feng, “Using metagenomic data to boost protein structure prediction and discovery,” Comput. Struct. Biotechnol. J., vol. 20, pp. 434–442, Jan. 2022, doi: 10.1016/J.CSBJ.2021.12.030.
Y. Xu and D. Wojtczak, “Dive into machine learning algorithms for influenza virus host prediction with hemagglutinin sequences,” Biosystems, vol. 220, p. 104740, Oct. 2022, doi: 10.1016/J.BIOSYSTEMS.2022.104740.
F. H. Coutinho et al., “RaFAH: Host prediction for viruses of Bacteria and Archaea based on protein content,” Patterns, vol. 2, no. 7, p. 100274, Jul. 2021, doi: 10.1016/J.PATTER.2021.100274.
Y. Yang et al., “Reservoir hosts prediction for COVID-19 by hybrid transfer learning model,” J. Biomed. Inform., vol. 117, p. 103736, May 2021, doi: 10.1016/J.JBI.2021.103736.
M. F. Drummond et al., “Challenges of Health Technology Assessment in Pluralistic Healthcare Systems: An ISPOR Council Report,” Value Heal., vol. 25, no. 8, pp. 1257–1267, Aug. 2022, doi: 10.1016/J.JVAL.2022.02.006.
L. Dey, S. Chakraborty, and A. Mukhopadhyay, “Machine learning techniques for sequence-based prediction of viral–host interactions between SARS-CoV-2 and human proteins,” Biomed. J., vol. 43, no. 5, pp. 438–450, Oct. 2020, doi: 10.1016/J.BJ.2020.08.003.
W. Zhang, X. Gu, L. Tang, Y. Yin, D. Liu, and Y. Zhang, “Application of machine learning, deep learning and optimization algorithms in geoengineering and geoscience: Comprehensive review and future challenge,” Gondwana Res., vol. 109, pp. 1–17, Sep. 2022, doi: 10.1016/J.GR.2022.03.015.
H. Hassan et al., “Supervised and weakly supervised deep learning models for COVID-19 CT diagnosis: A systematic review,” Comput. Methods Programs Biomed., vol. 218, p. 106731, May 2022, doi: 10.1016/J.CMPB.2022.106731.
S. Nematzadeh, F. Kiani, M. Torkamanian-Afshar, and N. Aydin, “Tuning hyperparameters of machine learning algorithms and deep neural networks using metaheuristics: A bioinformatics study on biomedical and biological cases,” Comput. Biol. Chem., vol. 97, p. 107619, Apr. 2022, doi: 10.1016/J.COMPBIOLCHEM.2021.107619.
T. Yan, S. L. Shen, A. Zhou, and X. Chen, “Prediction of geological characteristics from shield operational parameters by integrating grid search and K-fold cross validation into stacking classification algorithm,” J. Rock Mech. Geotech. Eng., vol. 14, no. 4, pp. 1292–1303, Aug. 2022, doi: 10.1016/J.JRMGE.2022.03.002.
T. B. Alakus and I. Turkoglu, “Prediction of viral-host interactions of COVID-19 by computational methods,” Chemom. Intell. Lab. Syst., vol. 228, p. 104622, Sep. 2022, doi: 10.1016/J.CHEMOLAB.2022.104622.
D. D. Holcomb et al., “Protocol to identify host-viral protein interactions between coagulation-related proteins and their genetic variants with SARS-CoV-2 proteins,” STAR Protoc., vol. 3, no. 3, p. 101648, Sep. 2022, doi: 10.1016/J.XPRO.2022.101648.
J. K. Das, S. Chakraborty, and S. Roy, “A scheme for inferring viral-host associations based on codon usage patterns identifies the most affected signaling pathways during COVID-19,” J. Biomed. Inform., vol. 118, p. 103801, Jun. 2021, doi: 10.1016/J.JBI.2021.103801.
C. C. Olisah, L. Smith, and M. Smith, “Diabetes mellitus prediction and diagnosis from a data preprocessing and machine learning perspective,” Comput. Methods Programs Biomed., vol. 220, p. 106773, Jun. 2022, doi: 10.1016/J.CMPB.2022.106773.
L. M. Matos, J. Azevedo, A. Matta, A. Pilastri, P. Cortez, and R. Mendes, “Categorical Attribute traNsformation Environment (CANE): A python module for categorical to numeric data preprocessing,” Softw. Impacts, vol. 13, p. 100359, Aug. 2022, doi: 10.1016/J.SIMPA.2022.100359.
J. Alcaraz, M. Labbé, and M. Landete, “Support Vector Machine with feature selection: A multiobjective approach,” Expert Syst. Appl., vol. 204, no. April, p. 117485, 2022, doi: 10.1016/j.eswa.2022.117485.
M. Marchetti, L. Fongaro, A. Bulgheroni, M. Wallenius, and K. Mayer, “Classification of uranium ore concentrates applying support vector machine to spectrophotometric and textural features,” Appl. Geochemistry, vol. 146, p. 105443, 2022, doi: https://doi.org/10.1016/j.apgeochem.2022.105443.
K. R. Singh, K. P. Neethu, K. Madhurekaa, A. Harita, and P. Mohan, “Parallel SVM model for forest fire prediction,” Soft Comput. Lett., vol. 3, no. June, p. 100014, 2021, doi: 10.1016/j.socl.2021.100014.
R. Umar, I. Riadi, and Purwono, “Perbandingan Metode SVM, RF dan SGD untuk Penentuan Model Klasifikasi Kinerja Programmer pada Aktivitas Media Sosial,” J. RESTI (Rekayasa Sist. dan Teknol. Informasi), vol. 4, no. 2, pp. 329–335, 2020.
D. Maulina and R. Sagara, “Klasifikasi Artikel Hoax Menggunakan Support Vector Machine Linear Dengan Pembobotan Term Frequency-Inverse Document Frequency,” J. Mantik Penusa, vol. 2, no. 1, pp. 35–40, 2018.
T. Yan, S. L. Shen, A. Zhou, and X. Chen, “Prediction of geological characteristics from shield operational parameters by integrating grid search and K-fold cross validation into stacking classification algorithm,” J. Rock Mech. Geotech. Eng., vol. 14, no. 4, pp. 1292–1303, 2022, doi: 10.1016/j.jrmge.2022.03.002.
G. S. K. Ranjan, A. Kumar Verma, and S. Radhika, “K-Nearest Neighbors and Grid Search CV Based Real Time Fault Monitoring System for Industries,” in 2019 IEEE 5th International Conference for Convergence in Technology, I2CT 2019, 2019, no. March, doi: 10.1109/I2CT45611.2019.9033691.
T. Yan, S. L. Shen, A. Zhou, and X.-S. Chen, “Prediction of geological characteristics from shield operational parameters using integrating grid search and K-fold cross validation into stacking classification algorithm,” J. Rock Mech. Geotech. Eng., p. 100310, 2022, doi: https://doi.org/10.1016/j.jrmge.2022.03.002.
X. Xiong, S. Hu, D. Sun, S. Hao, H. Li, and G. Lin, “Detection of false data injection attack in power information physical system based on SVM–GAB algorithm,” Energy Reports, vol. 8, pp. 1156–1164, 2022, doi: 10.1016/j.egyr.2022.02.290.
S. Katoch, V. Singh, and U. S. Tiwary, “Indian Sign Language Recognition System using SURF with SVM and CNN,” Array, p. 100141, 2022, doi: https://doi.org/10.1016/j.array.2022.100141.
A. Luque, A. Carrasco, A. Martín, and A. de las Heras, “The impact of class imbalance in classification performance metrics based on the binary confusion matrix,” Pattern Recognit., vol. 91, pp. 216–231, 2019, doi: 10.1016/j.patcog.2019.02.023.
P. Purwono, A. Ma’arif, I. S. Mangku Negara, W. Rahmaniar, and J. Rahmawan, “Linkage Detection of Features that Cause Stroke using Feyn Qlattice Machine Learning Model,” J. Ilm. Tek. Elektro Komput. dan Inform., vol. 7, no. 3, p. 423, 2021, doi: 10.26555/jiteki.v7i3.22237.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Purwono Purwono, Annastasya Nabila Elsa Wulandari, Abdullah Cakan Indonesia

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All articles published in the JAHIR Journal are licensed under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license. This license grants the following permissions and obligations:
1. Permitted Uses:
- Sharing – You may copy and redistribute the material in any medium or format.
- Adaptation – You may remix, transform, and build upon the material for any purpose, including commercial use.
2. Conditions of Use:
- Attribution – You must give appropriate credit to the original author(s), provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in a way that suggests the licensor endorses you or your use.
- ShareAlike – If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original (CC BY-SA 4.0).
- No Additional Restrictions – You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
3. Disclaimer:
- The JAHIR Journal and the authors are not responsible for any modifications, interpretations, or derivative works made by third parties using the published content.
- This license does not affect the ownership of copyrights, and authors retain full rights to their work.
For further details, please refer to the official Creative Commons Attribution-ShareAlike 4.0 International License.