Hybrid Ensemble Learning for Classifying Prescription vs. Over-the-Counter Medicines on Large-Scale Categorical and Textual Data
DOI:
https://doi.org/10.59247/jahir.v3i1.341Keywords:
Prescription (Rx) drug, Over-the-Counter (OTC) drug, Interpretable CART, TF-IDF, LightGBM ensembleAbstract
The classification of drugs into Prescription (Rx) and Over-the-Counter (OTC) categories is an important aspect of pharmaceutical governance because it has a direct impact on patient safety, drug access, and regulatory compliance. However, large-scale pharmaceutical data often consists of heterogeneous categorical variables and short texts, such as product names or indications, which poses challenges in the form of duplication, inconsistencies, and potential class imbalances. This condition demands a modeling approach that is not only accurate, but also lightweight and explainable. This study proposes a hybrid ensemble model that combines three algorithms, namely CART, Random Forest, and LightGBM, through a weighted soft-voting mechanism. This approach combines decision tree transparency with the reliability of modern boosting techniques. The main contribution of this study is to show that a low-complexity domain-based pipeline can produce accurate, efficient, and easily auditable Rx and OTC classifications for both clinical and regulatory needs. The pre-processing pipeline includes TF-IDF for short text, One-Hot Encoding for categorical features, as well as simple dosage variables. All features were combined into a solid matrix, then trained using weighted ensembles [1,1,8]. Evaluations include Accuracy, Precision, Recall, F1-score, ROC-AUC, Brier score, confusion matrix, and ROC curve. Test results on a dataset of 50,000 balanced samples showed consistent in-sample performance: Accuracy = 0.742; Accuracy = 0.742; Recall = 0.742; F1 = 0.742; ROC-AUC = 0.819; then Brier score = 0.214. The model is able to stably distinguish classes with a balance between False Positive and False Negative errors. In conclusion, this lightweight ensemble is able to present competitive prediction performance as well as interpretation, so that it has the potential to be applied to pharmacovigilance and drug classification. Further studies suggest adding cross-validation, probability calibration, as well as robustness tests to data outside the distribution to strengthen the reliability of the model
References
A. Yasmeen et al., “Suspected inappropriate use of prescription and non-prescription drugs among requesting customers: A Saudi community pharmacists’ perspective,” Saudi Pharmaceutical Journal, vol. 31, no. 7, pp. 1254–1264, Jul. 2023, doi: 10.1016/j.jsps.2023.05.009.
A. Hatabu, Y.-S. Tian, H. Asano, K. Fukuzawa, and K. Ikeda, “A brief report of the status of self-medication with over-the-counter drugs: a pilot cross-sectional survey,” BMC Res Notes, vol. 18, no. 1, p. 37, Jan. 2025, doi: 10.1186/s13104-025-07114-5.
E. Toni, H. Ayatollahi, R. Abbaszadeh, and A. Fotuhi Siahpirani, “Machine Learning Techniques for Predicting Drug-Related Side Effects: A Scoping Review,” Pharmaceuticals, vol. 17, no. 6, p. 795, Jun. 2024, doi: 10.3390/ph17060795.
W. Guo, F. Dong, J. Liu, A. Aslam, T. A. Patterson, and H. Hong, “A refined set of RxNorm drug names for enhancing unstructured data analysis in drug safety surveillance,” Exp Biol Med, vol. 250, May 2025, doi: 10.3389/ebm.2025.10374.
S. Janiczak et al., “An Evaluation of Duplicate Adverse Event Reports Characteristics in the Food and Drug Administration Adverse Event Reporting System,” Drug Saf, vol. 48, no. 10, pp. 1119–1126, Oct. 2025, doi: 10.1007/s40264-025-01560-7.
S. Dimitsaki, P. Natsiavas, and M.-C. Jaulent, “Applying AI to Structured Real-World Data for Pharmacovigilance Purposes: Scoping Review,” J Med Internet Res, vol. 26, p. e57824, Dec. 2024, doi: 10.2196/57824.
K. Cao-Van, T. C. Minh, L. G. Minh, T. T. B. Quyen, and H. M. Tan, “Soft-Voting Ensemble Model: An Efficient Learning Approach for Predictive Prostate Cancer Risk,” Vietnam Journal of Computer Science, vol. 11, no. 04, pp. 531–552, Nov. 2024, doi: 10.1142/S2196888824500155.
A. Argente-Garrido, C. Zuheros, M. V. Luzón, and F. Herrera, “An Interpretable Client Decision Tree Aggregation process for Federated Learning,” Apr. 2024, [Online]. Available: http://arxiv.org/abs/2404.02510
Z. Wang, H. Ren, R. Lu, and L. Huang, “Stacking Based LightGBM-CatBoost-RandomForest Algorithm and Its Application in Big Data Modeling,” in 2022 4th International Conference on Data-driven Optimization of Complex Systems (DOCS), IEEE, Oct. 2022, pp. 1–6. doi: 10.1109/DOCS55193.2022.9967714.
P. Guleria, J. Frnda, and P. N. Srinivasu, “NLP based text classification using TF-IDF enabled fine-tuned long short-term memory: An empirical analysis,” Array, vol. 27, p. 100467, Sep. 2025, doi: 10.1016/j.array.2025.100467.
Y. Zhang, L. He, Y. Zhang, P. Zhao, B. Zhang, and F. Cheng, “A comparative study of One-Hot,TF-IDF, and Word2Vec for Classifying Illegal Advertising Texts,” in Proceedings of the 2024 8th International Conference on Natural Language Processing and Information Retrieval, New York, NY, USA: ACM, Dec. 2024, pp. 82–86. doi: 10.1145/3711542.3711586.
B. Erdebilli and B. Devrim-İçtenbaş, “Ensemble Voting Regression Based on Machine Learning for Predicting Medical Waste: A Case from Turkey,” Mathematics, vol. 10, no. 14, p. 2466, Jul. 2022, doi: 10.3390/math10142466.
E. Mahamud, M. Assaduzzaman, J. Islam, N. Fahad, M. J. Hossen, and T. T. Ramanathan, “Enhancing Alzheimer’s disease detection: An explainable machine learning approach with ensemble techniques,” Intell Based Med, vol. 11, p. 100240, 2025, doi: 10.1016/j.ibmed.2025.100240.
W. Yang, J. Jiang, E. M. Schnellinger, S. E. Kimmel, and W. Guo, “Modified Brier score for evaluating prediction accuracy for binary outcomes,” Stat Methods Med Res, vol. 31, no. 12, pp. 2287–2296, Dec. 2022, doi: 10.1177/09622802221122391.
K. J. Sowmiya Narayanan and A. Manimaran, “Using Decision Risk and Decision Accuracy Metrics for Decision Making for Remote Sensing and GIS Applications,” 2024, pp. 125–136. doi: 10.1007/978-981-99-6229-7_11.
R. Jevsejev, D. Mažeika, and M. Bereiša, “An Approach for Building IT Support Dataset for Machine Learning Models,” in 2025 IEEE Open Conference of Electrical, Electronic and Information Sciences (eStream), IEEE, Apr. 2025, pp. 1–5. doi: 10.1109/eStream66938.2025.11016852.
I. Hasan and M. Tausif, “Designing an Interpretable and Efficient AutoML Pipeline for Enhanced Data Analytics,” in 2025 3rd International Conference on Self Sustainable Artificial Intelligence Systems (ICSSAS), IEEE, Jun. 2025, pp. 911–916. doi: 10.1109/ICSSAS66150.2025.11081354.
A. M, N. Savarimuthu, and S. M. S. Bhanu, “WoEEE: a hybrid approach for enhancement of categorical data transformation,” Int J Data Sci Anal, vol. 20, no. 7, pp. 6635–6663, Nov. 2025, doi: 10.1007/s41060-025-00845-5.
T. Al-Shehari and R. A. Alsowail, “An Insider Data Leakage Detection Using One-Hot Encoding, Synthetic Minority Oversampling and Machine Learning Techniques,” Entropy, vol. 23, no. 10, p. 1258, Sep. 2021, doi: 10.3390/e23101258.
E. Aljohani, “Enhancing Arabic Text Classification with a Hybrid Word Embedding Method,” in 2023 16th International Conference on Developments in eSystems Engineering (DeSE), IEEE, Dec. 2023, pp. 696–701. doi: 10.1109/DeSE60595.2023.10468772.
D. Zhou and J. He, “Rare Category Analysis for Complex Data: A Review,” ACM Comput Surv, vol. 56, no. 5, pp. 1–35, May 2024, doi: 10.1145/3626520.
F. Pargent, F. Pfisterer, J. Thomas, and B. Bischl, “Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features,” Comput Stat, vol. 37, no. 5, pp. 2671–2692, Nov. 2022, doi: 10.1007/s00180-022-01207-6.
A. Abedinia and V. Seydi, “Building semi-supervised decision trees with semi-cart algorithm,” International Journal of Machine Learning and Cybernetics, vol. 15, no. 10, pp. 4493–4510, Oct. 2024, doi: 10.1007/s13042-024-02161-z.
N. E. I. Karabadji, A. Amara Korba, A. Assi, H. Seridi, S. Aridhi, and W. Dhifli, “Accuracy and diversity-aware multi-objective approach for random forest construction,” Expert Syst Appl, vol. 225, p. 120138, Sep. 2023, doi: 10.1016/j.eswa.2023.120138.
J. Huang and W. Chen, “A Study on Category Classification Based on LightGBM for Signal Feature Extraction and K-Means Clustering,” in 2023 IEEE 5th International Conference on Power, Intelligent Computing and Systems (ICPICS), IEEE, Jul. 2023, pp. 858–862. doi: 10.1109/ICPICS58376.2023.10235522.
R. Blanquero, E. Carrizosa, C. Molero-Río, and D. Romero Morales, “Optimal randomized classification trees,” Comput Oper Res, vol. 132, p. 105281, Aug. 2021, doi: 10.1016/j.cor.2021.105281.
T. T. Tran, N. Q. Phan, and H. X. Huynh, “Random Forest Model Parameters Optimization,” 2025, pp. 237–247. doi: 10.1007/978-981-97-9616-8_19.
S. Li, N. Jin, A. Dogani, Y. Yang, M. Zhang, and X. Gu, “Enhancing LightGBM for Industrial Fault Warning: An Innovative Hybrid Algorithm,” Processes, vol. 12, no. 1, p. 221, Jan. 2024, doi: 10.3390/pr12010221.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Reina Melani, Dina Febrina

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All articles published in the JAHIR Journal are licensed under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license. This license grants the following permissions and obligations:
1. Permitted Uses:
- Sharing – You may copy and redistribute the material in any medium or format.
- Adaptation – You may remix, transform, and build upon the material for any purpose, including commercial use.
2. Conditions of Use:
- Attribution – You must give appropriate credit to the original author(s), provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in a way that suggests the licensor endorses you or your use.
- ShareAlike – If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original (CC BY-SA 4.0).
- No Additional Restrictions – You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
3. Disclaimer:
- The JAHIR Journal and the authors are not responsible for any modifications, interpretations, or derivative works made by third parties using the published content.
- This license does not affect the ownership of copyrights, and authors retain full rights to their work.
For further details, please refer to the official Creative Commons Attribution-ShareAlike 4.0 International License.



