A Model for Drug Discovery on Unstructured Text using Semi-Supervised Learning and Fuzzy Matching

Mulunda, Christine; Wagacha, Peter; Muchemi, Lawrence

Center for Open Access in Science (COAS)
OPEN JOURNAL FOR INFORMATION TECHNOLOGY (OJIT)
ISSN (Online) 2620-0627 * ojit@centerprode.com

OJIT Home

2025 - Volume 8 - Number 1

A Model for Drug Discovery on Unstructured Text using Semi-Supervised Learning and Fuzzy Matching

Christine K. Mulunda * ORCID: 0000-0003-1914-0188
University of Nairobi, Faculty of Science and Technology, Department of Computing and Informatics, Nairobi, KENYA

Peter W. Wagacha * ORCID: 0000-0002-9597-1170
University of Nairobi, Faculty of Science and Technology, Department of Computing and Informatics, Nairobi, KENYA

Lawrence Muchemi * ORCID: 0000-0001-5911-5679
University of Nairobi, Faculty of Science and Technology, Department of Computing and Informatics, Nairobi, KENYA

Open Journal for Information Technology, 2025, 8(1), 9-20 * https://doi.org/10.32591/coas.ojit.0801.02009k
Received: 26 January 2025 ▪ Revised: 5 July 2025 ▪ Accepted: 22 July 2025

LICENCE: Creative Commons Attribution 4.0 International License.

ARTICLE (Full Text - PDF)

ABSTRACT:
Health related discoveries are mainly published as journal publications and the rate at which they are generated increases as new information and discoveries emerge. Discovery of latent medically-related terms in a document corpus is a challenging task where the researcher is not an expert in that domain and a viable database of medicine related words is not readily available. The objective of the study was to investigate the methodologies and best practices that enable discovery of latent drug terms found in health publications corpus for effective dissemination at county and national levels. Fuzzy matching methodology was considered for its near and exact matching algorithms. DrugBank dataset was chosen as reference for drug terms because of its comprehensive list of drugs, that are frequently updated and freely accessible. Semi-supervised learning was applied in modeling of multi-search medical terms on an hourly basis. Drug-name recognition, sentence categorization and information retrieval are among the features described in the presented model.

KEY WORDS: fuzzy matching, latent drug recognition, classification, information retrieval, dissemination.

CORRESPONDING AUTHOR:
Christine K. Mulunda, University of Nairobi, Faculty of Science and Technology, Department of Computing and Informatics, Nairobi, KENYA

REFERENCES:

[1] Anon.: National Cancer Control Strategy (2017 - 2022), Ministry of Public Health and Sanitation & Ministry of Medical Services, Kenya. https://repository.kippra.or.ke/handle/123456789/2802/.

[2] Anon.: Health Sector Strategic and Investment Plan (2013 - 2017), Ministry of Health, Kenya. http://guidelines.health.go.ke:8000/media/Kenya_Health_Sector_Strategic_Investment_Plan_2013_to_2017.pdf.

[3] A. Karami, A. Gangopadhyay, B. Zhou, and H. Kharrazi, “Fuzzy Approach Topic Discovery in Health and Medical Corpora,” International Journal of Fuzzy Systems, vol. 20, pp. 1334-1345, 2018. https://doi.org/10.1007/s40815-017-0327-9

[4] R. Rehrek and P. Sojka, “Software Framework for Topic Modelling with Large Corpora,” In Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks, pp. 46-50, Valletta, Malta, 2010.

[5] Anon., n.d: Ovid Medline, [Online] https://ovidsp.ovid.com/.

[6] Anon., n.d: PubMed, [Online] https://www.ncbi.nlm.nih.gov/pubmed/.

[7] H. Yu, T. Kim, J. Oh, S. Kim, “RefMed: relevance feedback retrieval system for PubMed,” In Proceedings of the 18th ACM conference on Information and knowledge management (CIKM ’09). Association for Computing Machinery, pp. 2099-2100, New York, NY, USA, 2009. https://doi.org/10.1145/1645953.1646322

[8] J. F. Fontaine, A. Barbosa-Silva, M. Schaefer, M. R. Huska, E. M. Muro, M. A. Andrade-Navarro, “MedlineRanker: flexible ranking of biomedical literature,” Nucleic Acids Research, vol. 37, pp. 141-146, 2009. https://doi.org/10.1093/nar/gkp353

[9] D. J. States, A. S. Ade, Z. C. Wright, A. V. Bookvich, B. D. Athey, “MiSearch adaptive pubMed search tool,” Bioinformatics, vol. 25(7), pp. 974-976, 2009. https://doi.org/10.1093/bioinformatics/btn033

[10] T. C. Rindflesch, H. Kilicoglu, M. Fiszman, G. Rosemblat, D. Shin, “Semantic MEDLINE: An advanced information management application for biomedicine,” Information Services and Use, vol. 31(1-2), pp. 15-21, 2011.

[11] G. L. Poulter, L. D Rubin, R. B Altman and C. Seoighe, “MScanner: a classifier for retrieving Medline citations,” BMC Bioinformatics, vol. 9(108), 2008. https://doi.org/10.1186/1471-2105-9-108

[12] M. Errami, J. D. Wren, J. M. Hicks. and H. R. Garner, “eTBLAST: a web server to identify expert reviewers, appropriate journals and similar publications,” Nucleic Acids Research, vol. 35, 2007. https://doi.org/10.1093/nar/gkm221

[13] M. V. Plikus, Z. Zhang, and C. Chuong, “PubFocus: semantic MEDLINE/PubMed citations analytics through integration of controlled biomedical dictionaries and ranking algorithm,” BMC Bioinformatics, 2007, vol. 7(424).

[14] Y. Yamamoto, and T. Takagi, “Biomedical knowledge navigation by literature clustering,” Journal of Biomedical Informatics, vol. 40, pp. 114-130, 2007. https://doi.org/10.1016/j.jbi.2006.07.004

[15] A. Doms and M. Schroeder, “GoPubMed: exploring PubMed with the Gene Ontology,” Nucleic Acids Research, vol. 33, pp. 783-786, 2005. https://doi.org/10.1093/nar/gki470

[16] S. M. Douglas, G. T. Montelione and M. Gerstein, “PubNet: a flexible system for visualizing literature derived networks,” Genome Biology, vol. 6(9), 2005. https://doi.org/10.1186/gb-2005-6-9-r80

[17] F. Liu, M. Ackerman and P. Fontelo, “BabelMeSH: Development of a Cross-Language Tool for MEDLINE/PubMed,” AMIA Annu Symp Proc., 2006.

[18] A. D. Eaton, “HubMed: a web-based biomedical literature search interface,” Nucleic Acids Research, vol. 34, pp. 745-747, 2006. https://doi.org/10.1093/nar/gkl037

[19] E. Faessler, and U. Hahn, “Semedico: A Comprehensive Semantic Search Engine for the Life Sciences,” Proceedings of ACL’17, System Demonstrations, Vancouver, Canada, 2017.

[20] Anon., n.d.: Google Scholar, [Online] https://en.wikipedia.org/wiki/Google_Scholar

[21] Anon., n.d.: Scopus, [Online] https://www.scopus.com/home.uri.

[22] D. Ramage, and E. Rosen, “Stanford Topic Modeling Toolbox, 2009. https://downloads.cs.stanford.edu/nlp/software/tmt/tmt-0.2/.

[23] A. K. McCallum, “MALLET: A Machine Learning for Language Toolkit, http://mallet.cs.umass.edu

[24] Y. Yang, Q. Yao and H. Qu, “VISTopic: A visual analytics system for making sense of large document collections using hierarchical topic modeling,” Visual Informatics, vol. 1(1), pp. 40-47, 2017. https://doi.org/10.1016/j.visinf.2017.01.005

[25] B. Gretarsson, J. O'Donovan, S. Bostandjiev, T. Hollerer, A. Asuncion, D. Newman, P. Smyth, “TopicNets: Visual Analysis of Large Text Corpora with Topic Modeling,” ACM Transactions on Intelligent Systems and Technology, vol. 3(2), pp. 1-26, 2012. https://dl.acm.org/doi/10.1145/2089094.2089099

[26] X. H. Phan and C. T. Nguyen, “GibbsLDA++: A C/C++ Implementation of Latent Dirichlet Allocation,” http://gibbslda.sourceforge.net.

[27] K. Dinakar, J. Chen, H. Lieberman, R. Picard, and R. Filbin, “Mixed-Initiative Real-Time Topic Modeling & Visualization for Crisis Counseling,” Proceedings of the 20th International Conference on Intelligent User Interfaces, pp. 417-426, 2015. https://doi.org/10.1145/2678025.27013

[28] C. K. Mulunda, P. W. Waiganjo and L. Muchemi, “Towards Implementation of an Information Dissemination Tool for Health Publications: Case of a Developing Country,” IST-Africa Conference (IST-Africa), Kampala, Uganda, pp. 1-11, 2020.

[29] https://www.ncbi.nlm.nih.gov/

[30] gandersen101 / spaczz

[31] C. K. Mulunda, P. W. Wagacha, and L. Muchemi, “Semi-Supervised Topic Model for Sequential Data: A Genetic Algorithm Approach,” 6th International Conference on Soft Computing & Machine Intelligence (ISCMI), pp. 90-94, Johannesburg, South Africa, 2019.

[32] D. S. Wishart, Y. D. Feunang, A. C. Guo, E. J. Lo, A. Marcu, J. R. Grant, T. Sajed, D. Johnson, C. Li, Z. Sayeeda, N. Assempour, I. Iynkkaran, Y. Liu, A. Maciejewski, N. Gale, A. Wilson, L. Chin, R. Cummings, D. Le, A. Pon, C. Knox, M. Wilson, “DrugBank5.0: a major update to the DrugBank database for 2018,” Nucleic Acids Res, 2018, https://doi.org/10.1093/nar/gkx1037

[33] S. Jain, K. R. Seeja, R. Jindal, “A New Methodology for Computing Semantic Relatedness: Modified Latent Semantic Analysis by Fuzzy Formal Concept Analysis,” Procedia Computer Science, vol. 167, pp. 1102-1109, 2020. https://doi.org/10.1016/j.procs.2020.03.412

[34] J. Rashid, S. S. Adnan, and A. Irtaza, “A novel fuzzy k-means latent semantic analysis (FKLSA) approach for topic modeling over medical and health text corpora,” Journal of Intelligent & Fuzzy Systems, vol. 37(5), pp. 6573–6588, 2019. http://dx.doi.org/10.3233/JIFS-182776

[35] J. Rashid, J. Kim, A. Hussain, U. Naseem, and S. Juneja, “A novel multiple kernel fuzzy topic modeling technique for biomedical data,” BMC Bioinformatics, vol. 23(275), 2022. https://doi.org/10.1186/s12859-022-04780-1

[36] E. Rijcken, F. Scheepers, P. Mosteiro, K. Zervanou, M. Spruit and U. Kaymak, “A Comparative Study of Fuzzy Topic Models and LDA in terms of Interpretability,” IEEE Symposium Series on Computational Intelligence (SSCI), Orlando, FL, USA, pp. 1-8, 2021. https://doi.org/10.1109/SSCI50451.2021.9660139

[37] N. Shekokar, K. Sampat, C. Chandawalla, J. Shah, “Implementation of Fuzzy Keyword Search over Encrypted Data in Cloud Computing,” Procedia Computer Science, vol. 45, pp. 499-505, 2015. https://doi.org/10.1016/j.procs.2015.03.089