Advances and critical assessment of machine learning techniques for prediction of docking scores

Advances and critical assessment of machine learning techniques for prediction of docking scores

Three machine learning (ML) approaches (TensorFlow, XGBoost, and SchNetPack) are used for prediction of inhibitory potential, expressed as docking score, towards SARS-CoV-2. ML train and test sets are based on ZINC15 database of compounds. Proposed ML models are evaluated based on their prediction accuracy, screening potential, and error estimation. Prediction errors are analyzed with respect to compound size, charge, and docking score, and their improvements towards ML prediction are discussed.


Abstract

Here we present three distinct machine learning (ML) approaches (TensorFlow, XGBoost, and SchNetPack) for docking score prediction. AutoDock Vina is used to evaluate the inhibitory potential of ZINC15 in-vivo and in-vitro-only sets towards the SARS-CoV-2 main protease. The in-vivo set (59 884 compounds) is used for ML training (max. 80%), validation (5%), and testing (15%). The in-vitro-only set (174 014 compounds) is used for the evaluation of prediction capability of the trained ML models. Contributions to the prediction error are analyzed with respect to compounds' charge, number of atoms, and expected inhibitory potential (docking score). Methods for the prediction error estimation of new compounds are considered, yet critically rejected. The ML input weighted with respect to the desired property (i.e., low docking score) in the machine learning models shows to be a promising option to improve the ML performance. Proposed models provide significant reduction in number of intriguing compounds that need to be investigated.