Hybrid Ensemble Learning for Classifying Prescription vs. Over-the-Counter Medicines on Large-Scale Categorical and Textual Data

Reina Melani; Dina Febrina

doi:10.59247/jahir.v3i1.341

Authors

Reina Melani Universitas Harapan Bangsa
Dina Febrina Universitas Harapan Bangsa

DOI:

https://doi.org/10.59247/jahir.v3i1.341

Keywords:

Prescription (Rx) drug, Over-the-Counter (OTC) drug, Interpretable CART, TF-IDF, LightGBM ensemble

Abstract

The classification of drugs into Prescription (Rx) and Over-the-Counter (OTC) categories is an important aspect of pharmaceutical governance because it has a direct impact on patient safety, drug access, and regulatory compliance. However, large-scale pharmaceutical data often consists of heterogeneous categorical variables and short texts, such as product names or indications, which poses challenges in the form of duplication, inconsistencies, and potential class imbalances. This condition demands a modeling approach that is not only accurate, but also lightweight and explainable. This study proposes a hybrid ensemble model that combines three algorithms, namely CART, Random Forest, and LightGBM, through a weighted soft-voting mechanism. This approach combines decision tree transparency with the reliability of modern boosting techniques. The main contribution of this study is to show that a low-complexity domain-based pipeline can produce accurate, efficient, and easily auditable Rx and OTC classifications for both clinical and regulatory needs. The pre-processing pipeline includes TF-IDF for short text, One-Hot Encoding for categorical features, as well as simple dosage variables. All features were combined into a solid matrix, then trained using weighted ensembles [1,1,8]. Evaluations include Accuracy, Precision, Recall, F1-score, ROC-AUC, Brier score, confusion matrix, and ROC curve. Test results on a dataset of 50,000 balanced samples showed consistent in-sample performance: Accuracy = 0.742; Accuracy = 0.742; Recall = 0.742; F1 = 0.742; ROC-AUC = 0.819; then Brier score = 0.214. The model is able to stably distinguish classes with a balance between False Positive and False Negative errors. In conclusion, this lightweight ensemble is able to present competitive prediction performance as well as interpretation, so that it has the potential to be applied to pharmacovigilance and drug classification. Further studies suggest adding cross-validation, probability calibration, as well as robustness tests to data outside the distribution to strengthen the reliability of the model

References

A. Yasmeen et al., “Suspected inappropriate use of prescription and non-prescription drugs among requesting customers: A Saudi community pharmacists’ perspective,” Saudi Pharmaceutical Journal, vol. 31, no. 7, pp. 1254–1264, Jul. 2023, doi: 10.1016/j.jsps.2023.05.009.

A. Hatabu, Y.-S. Tian, H. Asano, K. Fukuzawa, and K. Ikeda, “A brief report of the status of self-medication with over-the-counter drugs: a pilot cross-sectional survey,” BMC Res Notes, vol. 18, no. 1, p. 37, Jan. 2025, doi: 10.1186/s13104-025-07114-5.

E. Toni, H. Ayatollahi, R. Abbaszadeh, and A. Fotuhi Siahpirani, “Machine Learning Techniques for Predicting Drug-Related Side Effects: A Scoping Review,” Pharmaceuticals, vol. 17, no. 6, p. 795, Jun. 2024, doi: 10.3390/ph17060795.

W. Guo, F. Dong, J. Liu, A. Aslam, T. A. Patterson, and H. Hong, “A refined set of RxNorm drug names for enhancing unstructured data analysis in drug safety surveillance,” Exp Biol Med, vol. 250, May 2025, doi: 10.3389/ebm.2025.10374.

S. Janiczak et al., “An Evaluation of Duplicate Adverse Event Reports Characteristics in the Food and Drug Administration Adverse Event Reporting System,” Drug Saf, vol. 48, no. 10, pp. 1119–1126, Oct. 2025, doi: 10.1007/s40264-025-01560-7.

S. Dimitsaki, P. Natsiavas, and M.-C. Jaulent, “Applying AI to Structured Real-World Data for Pharmacovigilance Purposes: Scoping Review,” J Med Internet Res, vol. 26, p. e57824, Dec. 2024, doi: 10.2196/57824.

K. Cao-Van, T. C. Minh, L. G. Minh, T. T. B. Quyen, and H. M. Tan, “Soft-Voting Ensemble Model: An Efficient Learning Approach for Predictive Prostate Cancer Risk,” Vietnam Journal of Computer Science, vol. 11, no. 04, pp. 531–552, Nov. 2024, doi: 10.1142/S2196888824500155.

A. Argente-Garrido, C. Zuheros, M. V. Luzón, and F. Herrera, “An Interpretable Client Decision Tree Aggregation process for Federated Learning,” Apr. 2024, [Online]. Available: http://arxiv.org/abs/2404.02510

Z. Wang, H. Ren, R. Lu, and L. Huang, “Stacking Based LightGBM-CatBoost-RandomForest Algorithm and Its Application in Big Data Modeling,” in 2022 4th International Conference on Data-driven Optimization of Complex Systems (DOCS), IEEE, Oct. 2022, pp. 1–6. doi: 10.1109/DOCS55193.2022.9967714.

P. Guleria, J. Frnda, and P. N. Srinivasu, “NLP based text classification using TF-IDF enabled fine-tuned long short-term memory: An empirical analysis,” Array, vol. 27, p. 100467, Sep. 2025, doi: 10.1016/j.array.2025.100467.

Y. Zhang, L. He, Y. Zhang, P. Zhao, B. Zhang, and F. Cheng, “A comparative study of One-Hot,TF-IDF, and Word2Vec for Classifying Illegal Advertising Texts,” in Proceedings of the 2024 8th International Conference on Natural Language Processing and Information Retrieval, New York, NY, USA: ACM, Dec. 2024, pp. 82–86. doi: 10.1145/3711542.3711586.

B. Erdebilli and B. Devrim-İçtenbaş, “Ensemble Voting Regression Based on Machine Learning for Predicting Medical Waste: A Case from Turkey,” Mathematics, vol. 10, no. 14, p. 2466, Jul. 2022, doi: 10.3390/math10142466.

E. Mahamud, M. Assaduzzaman, J. Islam, N. Fahad, M. J. Hossen, and T. T. Ramanathan, “Enhancing Alzheimer’s disease detection: An explainable machine learning approach with ensemble techniques,” Intell Based Med, vol. 11, p. 100240, 2025, doi: 10.1016/j.ibmed.2025.100240.

W. Yang, J. Jiang, E. M. Schnellinger, S. E. Kimmel, and W. Guo, “Modified Brier score for evaluating prediction accuracy for binary outcomes,” Stat Methods Med Res, vol. 31, no. 12, pp. 2287–2296, Dec. 2022, doi: 10.1177/09622802221122391.

K. J. Sowmiya Narayanan and A. Manimaran, “Using Decision Risk and Decision Accuracy Metrics for Decision Making for Remote Sensing and GIS Applications,” 2024, pp. 125–136. doi: 10.1007/978-981-99-6229-7_11.

R. Jevsejev, D. Mažeika, and M. Bereiša, “An Approach for Building IT Support Dataset for Machine Learning Models,” in 2025 IEEE Open Conference of Electrical, Electronic and Information Sciences (eStream), IEEE, Apr. 2025, pp. 1–5. doi: 10.1109/eStream66938.2025.11016852.

I. Hasan and M. Tausif, “Designing an Interpretable and Efficient AutoML Pipeline for Enhanced Data Analytics,” in 2025 3rd International Conference on Self Sustainable Artificial Intelligence Systems (ICSSAS), IEEE, Jun. 2025, pp. 911–916. doi: 10.1109/ICSSAS66150.2025.11081354.

A. M, N. Savarimuthu, and S. M. S. Bhanu, “WoEEE: a hybrid approach for enhancement of categorical data transformation,” Int J Data Sci Anal, vol. 20, no. 7, pp. 6635–6663, Nov. 2025, doi: 10.1007/s41060-025-00845-5.

T. Al-Shehari and R. A. Alsowail, “An Insider Data Leakage Detection Using One-Hot Encoding, Synthetic Minority Oversampling and Machine Learning Techniques,” Entropy, vol. 23, no. 10, p. 1258, Sep. 2021, doi: 10.3390/e23101258.

E. Aljohani, “Enhancing Arabic Text Classification with a Hybrid Word Embedding Method,” in 2023 16th International Conference on Developments in eSystems Engineering (DeSE), IEEE, Dec. 2023, pp. 696–701. doi: 10.1109/DeSE60595.2023.10468772.

D. Zhou and J. He, “Rare Category Analysis for Complex Data: A Review,” ACM Comput Surv, vol. 56, no. 5, pp. 1–35, May 2024, doi: 10.1145/3626520.

F. Pargent, F. Pfisterer, J. Thomas, and B. Bischl, “Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features,” Comput Stat, vol. 37, no. 5, pp. 2671–2692, Nov. 2022, doi: 10.1007/s00180-022-01207-6.

A. Abedinia and V. Seydi, “Building semi-supervised decision trees with semi-cart algorithm,” International Journal of Machine Learning and Cybernetics, vol. 15, no. 10, pp. 4493–4510, Oct. 2024, doi: 10.1007/s13042-024-02161-z.

N. E. I. Karabadji, A. Amara Korba, A. Assi, H. Seridi, S. Aridhi, and W. Dhifli, “Accuracy and diversity-aware multi-objective approach for random forest construction,” Expert Syst Appl, vol. 225, p. 120138, Sep. 2023, doi: 10.1016/j.eswa.2023.120138.

J. Huang and W. Chen, “A Study on Category Classification Based on LightGBM for Signal Feature Extraction and K-Means Clustering,” in 2023 IEEE 5th International Conference on Power, Intelligent Computing and Systems (ICPICS), IEEE, Jul. 2023, pp. 858–862. doi: 10.1109/ICPICS58376.2023.10235522.

R. Blanquero, E. Carrizosa, C. Molero-Río, and D. Romero Morales, “Optimal randomized classification trees,” Comput Oper Res, vol. 132, p. 105281, Aug. 2021, doi: 10.1016/j.cor.2021.105281.

T. T. Tran, N. Q. Phan, and H. X. Huynh, “Random Forest Model Parameters Optimization,” 2025, pp. 237–247. doi: 10.1007/978-981-97-9616-8_19.

S. Li, N. Jin, A. Dogani, Y. Yang, M. Zhang, and X. Gu, “Enhancing LightGBM for Industrial Fault Warning: An Innovative Hybrid Algorithm,” Processes, vol. 12, no. 1, p. 221, Jan. 2024, doi: 10.3390/pr12010221.

About the Journal	Journal Policies	Author	Information
Focus and Scope Editorial Board International Reviewer Open Access Statement Sponsorships Contact Us	Publication Ethics Peer Review Policy Review Guideline Digital Archiving Advertising Policy	Author Guidelines Online Submission Author Fee / Article Publication Charge Plagiarism Policy Article Retraction	For Readers For Authors For Librarians Journal History

Hybrid Ensemble Learning for Classifying Prescription vs. Over-the-Counter Medicines on Large-Scale Categorical and Textual Data

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

1. Permitted Uses:

2. Conditions of Use:

3. Disclaimer:

Submission

Article Template

Navigation

Tools