A classical QSAR paper, with less than 50 compounds, hundreds of descriptors, and some machine learning. They cite the CDK as a free tool to calculate descriptors, but use something else. The article compares PLS, ANN, and SVM, in the typical bad way, by not splitting out the effect of the kernel (RBF) from the regression model, making the comparison pretty uninformative.

If I scanned the paper correctly, they use a single test set, with LOO cross-validation for modeling method parameter estimation. The test set compounds are picked at the outer sides of the end point range, and no information is given on the variance in R^2 and Q^2 statistics. BTW, these two statistics are surprisingly close to each other (for each method separately). I wonder if that applies to all possible test sets, and some bootstrapping seems in order here.

Also, stepwise MLR was used for descriptor selection, thus prior to statistical modeling, and it seems to me PLS, ANN, and SVR was performed in this subset! Well, that makes the comparison even less relevant, as PLS does not require such prior selection. Moreover, it is know the stepwise MLR easily leads to local minima, not to the most optimal combination of descriptors.
Shared publiclyView activity