Preview

National Health Care (Russia)

Advanced search

Disease diagnosis from unstructured medical texts using machine learning techniques

https://doi.org/10.47093/2713-069X.2025.6.4.55-63

Abstract

Modern machine learning methods open new opportunities for analyzing medical texts. The use of unstructured data enables improved clinical decision support and the development of personalized patient treatment approaches.

The aim of the study: to develop an optimal algorithm for disease prediction using multi-label classification based on medical texts from selected patient treatment cases.

Materials and methods. The study utilized anonymized electronic medical records of 387 590 patients. Textual data were processed using lemmatization and vectorization based on a pretrained FastText model. A multi-label classification model was developed to predict 156 diagnostic categories grouped by major disease classes. Neural network architectures and decision tree ensembles were applied for model building.

Results. The proposed models demonstrated high effectiveness. The use of various text vector aggregation methods improved prediction quality. The model showed stability and clinical interpretability, supporting its applicability in real-world medical practice.

Conclusion. The developed approach to analyzing unstructured medical texts using machine learning methods is a promising tool for disease diagnosis support. Further research will focus on improving model interpretability and adapting models to diverse clinical data sources.

About the Authors

A. D. Ermak
K-SkAI LLC
Россия

Andrey D. Ermak – Data Analyst, Artificial Intelligence Division

Varkaus Embankment, 17, Petrozavodsk, 185910



E. A. Makarova
K-SkAI LLC
Россия

Elena A. Makarova – Cand. of Sci. (Technical), Head of the Artificial Intelligence Division

Varkaus Embankment, 17, Petrozavodsk, 185910



A. N. Kaftanov
K-SkAI LLC
Россия

Aleksey N. Kaftanov – Cand. of Sci. (Medicine), Data Analyst, Artificial Intelligence Division

Varkaus Embankment, 17, Petrozavodsk, 185910



D. V. Gavrilov
K-SkAI LLC
Россия

Denis V. Gavrilov – Head of the Medical Division

Varkaus Embankment, 17, Petrozavodsk, 185910



R. E. Novitsky
K-SkAI LLC
Россия

Roman E. Novitsky – general manager

Varkaus Embankment, 17, Petrozavodsk, 185910



А. V. Gusev
Russian Research Institute of Health
Россия

Alexandr V. Gusev – Cand. of Sci. (Technical), Senior Research Fellow, Department of Scientific Foundations of Health Care Organization

Dobrolyubova str., 11, Moscow, 127254



References

1. Spasic I., Nenadic G. Clinical text data in machine learning: Systematic review. JMIR Medical Informatics. 2020; 8(3): e17984. https://doi.org/10.2196/17984. PMID: 32229465

2. Hossain E., Rana R., Higgins N., et al. Natural Language Processing in Electronic Health Records in relation to healthcare decision-making: A systematic review. Computers in Biology and Medicine. 2023; 155: 106649. https://doi.org/10.1016/j.compbiomed.2023.106649. PMID: 36805219

3. Wu S., Roberts K., Datta S., et al. Deep learning in clinical natural language processing: A methodical review. Journal of the American Medical Informatics Association. 2020; 27(3): 457–470. https://doi.org/10.1093/jamia/ocz200. PMID: 31794016

4. Kesiku C.Y.Y., Chaves-Villota A., Garcia-Zapirain B. Natural Language Processing Techniques for Text Classification of Biomedical Documents: A Systematic Review. Information (Switzerland). 2022; 13(10): 499. https://doi.org/10.3390/info13100499.

5. Masud J.H.B., Kuo C.C., Yeh C.Y., et al. Applying Deep Learning Model to Predict Diagnosis Code of Medical Records. Diagnostics. 2023; 13(13): 2297. https://doi.org/10.3390/diagnostics13132297

6. Huang J., Osorio C., Sy L.W. An empirical evaluation of deep learning for ICD-9 code assignment using MIMIC-III clinical notes. Computers Methods and Programs in Biomedicine. 2019; 177: 141–153. https://doi.org/10.1016/j.cmpb.2019.05.024. PMID: 31319942

7. Zeng M., Li M., Fei Z., et al. Automatic ICD-9 coding via deep transfer learning. Neurocomputing. 2019; 324: 43–50. https://doi.org/10.1016/j.neucom.2018.04.081

8. Blanco A., Perez-de-Viñaspre O., Pérez A., et al. Boosting ICD multi-label classification of health records with contextual embeddings and label-granularity. Computers Methods and Programs in Biomedicine. 2020; 188: 105264. https://doi.org/10.1016/j.cmpb.2019.105264. PMID: 31851906

9. Zhang K., Ma H., Zhao Y., et al. The Comparative Experimental Study of Multilabel Classification for Diagnosis Assistant Based on Chinese Obstetric EMRs. Journal of Healthcare Engineering. 2018; 2018: 7273451. https://doi.org/10.1155/2018/7273451. PMID: 29666671

10. Korobov M. Morphological analyzer and generator for Russian and Ukrainian languages. Communications in Computer and Information Science. Springer Verlag; 2015; 542: 320–332. https://doi.org/10.1007/978-3-319-26123-2_31

11. Bergstra J., Bengio Y. Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research. 2012; 13: 281–305.

12. Sokolova M., Lapalme G. A systematic analysis of performance measures for classification tasks. Information Processing and Management. 2009; 45(4): 427–437. https://doi.org/10.1016/j.ipm.2009.03.002

13. Hinojosa Lee M.C., Braet J., Springael J. Performance Metrics for Multilabel Emotion Classification: Comparing Micro, Macro, and Weighted F1-Scores. Applied Sciences. 2024; 14(21): 9863. https://doi.org/10.3390/app14219863

14. Maltoudoglou L., Paisios A., Lenc L., et al. Well-calibrated confidence measures for multi-label text classification with a large number of labels. Pattern Recognition. 2022; 122: 108271. https://doi.org/10.1016/j.patcog.2021.108271

15. Chawla N.V., Bowyer K.W., Hall L.O., et al. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research. 2002; 16(1): 321–357. https://doi.org/10.1613/jair.953

16. He H., Bai Y., Garcia E.A., et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. Proceedings of the International Joint Conference on Neural Networks. 2008: 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969


Supplementary files

1. Appendix 1. Diseases and their groups used as labels for training the machine learning models
Subject
Type Исследовательские инструменты
Download (606KB)    
Indexing metadata ▾

Review

For citations:


Ermak A.D., Makarova E.A., Kaftanov A.N., Gavrilov D.V., Novitsky R.E., Gusev А.V. Disease diagnosis from unstructured medical texts using machine learning techniques. National Health Care (Russia). 2025;6(4):55-63. (In Russ.) https://doi.org/10.47093/2713-069X.2025.6.4.55-63

Views: 85

JATS XML


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2713-069X (Print)
ISSN 2713-0703 (Online)