Disease diagnosis from unstructured medical texts using machine learning techniques
https://doi.org/10.47093/2713-069X.2025.6.4.55-63
Abstract
Modern machine learning methods open new opportunities for analyzing medical texts. The use of unstructured data enables improved clinical decision support and the development of personalized patient treatment approaches.
The aim of the study: to develop an optimal algorithm for disease prediction using multi-label classification based on medical texts from selected patient treatment cases.
Materials and methods. The study utilized anonymized electronic medical records of 387 590 patients. Textual data were processed using lemmatization and vectorization based on a pretrained FastText model. A multi-label classification model was developed to predict 156 diagnostic categories grouped by major disease classes. Neural network architectures and decision tree ensembles were applied for model building.
Results. The proposed models demonstrated high effectiveness. The use of various text vector aggregation methods improved prediction quality. The model showed stability and clinical interpretability, supporting its applicability in real-world medical practice.
Conclusion. The developed approach to analyzing unstructured medical texts using machine learning methods is a promising tool for disease diagnosis support. Further research will focus on improving model interpretability and adapting models to diverse clinical data sources.
About the Authors
A. D. ErmakРоссия
Andrey D. Ermak – Data Analyst, Artificial Intelligence Division
Varkaus Embankment, 17, Petrozavodsk, 185910
E. A. Makarova
Россия
Elena A. Makarova – Cand. of Sci. (Technical), Head of the Artificial Intelligence Division
Varkaus Embankment, 17, Petrozavodsk, 185910
A. N. Kaftanov
Россия
Aleksey N. Kaftanov – Cand. of Sci. (Medicine), Data Analyst, Artificial Intelligence Division
Varkaus Embankment, 17, Petrozavodsk, 185910
D. V. Gavrilov
Россия
Denis V. Gavrilov – Head of the Medical Division
Varkaus Embankment, 17, Petrozavodsk, 185910
R. E. Novitsky
Россия
Roman E. Novitsky – general manager
Varkaus Embankment, 17, Petrozavodsk, 185910
А. V. Gusev
Россия
Alexandr V. Gusev – Cand. of Sci. (Technical), Senior Research Fellow, Department of Scientific Foundations of Health Care Organization
Dobrolyubova str., 11, Moscow, 127254
References
1. Spasic I., Nenadic G. Clinical text data in machine learning: Systematic review. JMIR Medical Informatics. 2020; 8(3): e17984. https://doi.org/10.2196/17984. PMID: 32229465
2. Hossain E., Rana R., Higgins N., et al. Natural Language Processing in Electronic Health Records in relation to healthcare decision-making: A systematic review. Computers in Biology and Medicine. 2023; 155: 106649. https://doi.org/10.1016/j.compbiomed.2023.106649. PMID: 36805219
3. Wu S., Roberts K., Datta S., et al. Deep learning in clinical natural language processing: A methodical review. Journal of the American Medical Informatics Association. 2020; 27(3): 457–470. https://doi.org/10.1093/jamia/ocz200. PMID: 31794016
4. Kesiku C.Y.Y., Chaves-Villota A., Garcia-Zapirain B. Natural Language Processing Techniques for Text Classification of Biomedical Documents: A Systematic Review. Information (Switzerland). 2022; 13(10): 499. https://doi.org/10.3390/info13100499.
5. Masud J.H.B., Kuo C.C., Yeh C.Y., et al. Applying Deep Learning Model to Predict Diagnosis Code of Medical Records. Diagnostics. 2023; 13(13): 2297. https://doi.org/10.3390/diagnostics13132297
6. Huang J., Osorio C., Sy L.W. An empirical evaluation of deep learning for ICD-9 code assignment using MIMIC-III clinical notes. Computers Methods and Programs in Biomedicine. 2019; 177: 141–153. https://doi.org/10.1016/j.cmpb.2019.05.024. PMID: 31319942
7. Zeng M., Li M., Fei Z., et al. Automatic ICD-9 coding via deep transfer learning. Neurocomputing. 2019; 324: 43–50. https://doi.org/10.1016/j.neucom.2018.04.081
8. Blanco A., Perez-de-Viñaspre O., Pérez A., et al. Boosting ICD multi-label classification of health records with contextual embeddings and label-granularity. Computers Methods and Programs in Biomedicine. 2020; 188: 105264. https://doi.org/10.1016/j.cmpb.2019.105264. PMID: 31851906
9. Zhang K., Ma H., Zhao Y., et al. The Comparative Experimental Study of Multilabel Classification for Diagnosis Assistant Based on Chinese Obstetric EMRs. Journal of Healthcare Engineering. 2018; 2018: 7273451. https://doi.org/10.1155/2018/7273451. PMID: 29666671
10. Korobov M. Morphological analyzer and generator for Russian and Ukrainian languages. Communications in Computer and Information Science. Springer Verlag; 2015; 542: 320–332. https://doi.org/10.1007/978-3-319-26123-2_31
11. Bergstra J., Bengio Y. Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research. 2012; 13: 281–305.
12. Sokolova M., Lapalme G. A systematic analysis of performance measures for classification tasks. Information Processing and Management. 2009; 45(4): 427–437. https://doi.org/10.1016/j.ipm.2009.03.002
13. Hinojosa Lee M.C., Braet J., Springael J. Performance Metrics for Multilabel Emotion Classification: Comparing Micro, Macro, and Weighted F1-Scores. Applied Sciences. 2024; 14(21): 9863. https://doi.org/10.3390/app14219863
14. Maltoudoglou L., Paisios A., Lenc L., et al. Well-calibrated confidence measures for multi-label text classification with a large number of labels. Pattern Recognition. 2022; 122: 108271. https://doi.org/10.1016/j.patcog.2021.108271
15. Chawla N.V., Bowyer K.W., Hall L.O., et al. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research. 2002; 16(1): 321–357. https://doi.org/10.1613/jair.953
16. He H., Bai Y., Garcia E.A., et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. Proceedings of the International Joint Conference on Neural Networks. 2008: 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969
Supplementary files
|
1. Appendix 1. Diseases and their groups used as labels for training the machine learning models | |
| Subject | ||
| Type | Исследовательские инструменты | |
Download
(606KB)
|
Indexing metadata ▾ | |
Review
For citations:
Ermak A.D., Makarova E.A., Kaftanov A.N., Gavrilov D.V., Novitsky R.E., Gusev А.V. Disease diagnosis from unstructured medical texts using machine learning techniques. National Health Care (Russia). 2025;6(4):55-63. (In Russ.) https://doi.org/10.47093/2713-069X.2025.6.4.55-63
JATS XML














