For a given protein structure, the MAPSHB model predicts the probability of each hydrogen bond to form a SHB. By defining a classification threshold, one can then categorize a hydrogen bond as a SHB if its probability is above or equal to this value, or a NHB if the probability is below it. As such, one can adjust the classification threshold and tune the prediction performances of the MAPSHB and MAPSHB-Ligand models to suit a specific research need.
Two metrics are used to assess the performance of the MAPSHB and MAPSHB-Ligand models. One is precision, which is the proportion of true SHBs among the predicted SHBs. The other is recall, which is the percentage of correctly predicted SHBs within the total SHBs in the dataset. While larger values of both metrics correspond to higher model performances, one cannot increase them simultaneously by changing the probability threshold.
Let's use the MAPSHB model as an example. From the following table, if one requires the model to have highly precise predictions of SHBs in their protein structures, a stringent threshold of 0.996 can be used to control the precision to be 95%. However, the recall is relatively low and only 20% of SHBs in our test dataset are identified. If one instead wants to explore all the plausible SHBs in a protein, a small threshold such as 0.062 can be chosen as it increases the recall to 94%. Therefore, one can use the data in the following table as a guidance and adjust the balance between the precision and recall of the MAPSHB model for their systems. Our recommended (and default) probability threshold is 0.870 for both MAPSHB and MAPSHB-Ligand models, with which the models can achieve a high precision while maintaining a relatively high recall.
| MAPSHB Model | MAPSHB-Ligand Model | |||
| Probability Threshold | Precision | Recall | Precision | Recall |
| 0.996 | 95% | 20% | 98% | 56% |
| 0.979 | 90% | 50% | 93% | 68% |
| 0.943 | 85% | 66% | 89% | 75% |
| 0.870 | 80% | 75% | 86% | 80% |
| 0.807 | 75% | 80% | 85% | 81% |
| 0.740 | 70% | 83% | 84% | 83% |
| 0.656 | 65% | 85% | 82% | 85% |
| 0.555 | 60% | 87% | 79% | 87% |
| 0.448 | 55% | 89% | 77% | 92% |
| 0.304 | 50% | 92% | 74% | 94% |
| 0.159 | 45% | 93% | 70% | 95% |
| 0.062 | 40% | 94% | 67% | 96% |
More discussions of the probability threshold are provided in the papers
S. Zhou, Y. Liu, S. Wang and L. Wang, "Effective prediction of short hydrogen bonds in proteins via machine learning method", Sci. Rep., 12, 469 (2022).
S. Zhou, Y. Liu, S. Wang and L. Wang, "Chemical Features and Machine Learning Assisted Predictions of Protein-Ligand Short Hydrogen Bonds", Sci. Rep., accepted