This release contains numerous improvements, bug fixes, and new methods. The main highlights are a standalone DD-SIMCA implementation, overhauled preprocessing pipelines (with some breaking changes), and JSON/CSV interoperability with the mda.tools web-applications.The tutorial has been updated and improved accordingly.
In previous version DD-SIMCA was implemented via more versatile method simca, which lets also use other SIMCA implementations. While versatility is in general good, it limited the DD-SIMCA possibilities and it was decided to implement it separately.
Method ddsimca can now be used for training, testing and applying of Data Driven SIMCA models. It matches functionality of the corresponding web-application, including all plots and figures of merits (for example estimation of beta, selectivity, etc.). It also lets you change decision boundary parameters without rebuilding the main model.
See all details in the tutorial.
The original method simca is still available (and always will) for compatibility.
There are several new methods for preprocessing, including:
prep.spikes() for cosmic spikes removal from Raman spectra.prep.center() for centering of data columns.prep.scale() for scaling of data columns.prep.emsc() for extended multiplicative scatter correction.The following methods are considered as deprecated, you can still use them (they will be kept for compatibility), but for new code it is recommended to use the alternatives:
prep.autoscale() — use prep.center() and prep.scale() instead.prep.snv() — use prep.norm() with parameter type = "snv" as an alternative.prep.msc() — use prep.emsc() with parameter degree = 0.employ.prep() — use prep.apply() instead.In addition to that, the possibility to combine the preprocessing methods together into preprocessing chain (we will call it preprocessing model) has been improved. However, these improvements cause breaking changes, so if you used this feature before, check the text below and the updated user guides very carefully.
First of all, the syntax for creating preprocessing items for combining them to the chain has changed — parameters are now passed as named arguments instead of a list:
prep("savgol", list(width = 7, porder = 2, dorder = 2))prep("savgol", width = 7, porder = 2, dorder = 2)Also, the option to add user defined preprocessing methods into a preprocessing model has been removed as it caused issues and non-stable behavior in some cases. From this version, only selected methods can be combined together to a preprocessing model. You can see a full list of the currently supported methods by running prep.list(). This list will be extended eventually. And you can still use user defined methods and methods which are not in the list separately.
Second, a preprocessing model can now be integrated directly into pca, pls, simca, ddsimca, and plsda via the prep parameter — the model will train the preprocessing pipeline and apply it automatically during calibration and prediction.
Finally, preprocessing methods can also be trained independently using prep.fit(), which pre-computes parameters that depend on the training set (e.g. values for centering, scaling, or the reference spectrum for EMSC). Applying the trained preprocessing model to new data is done with prep.apply().
Please check the updated documentation for all details and examples.
Models created with pca, pls, and ddsimca can now be exported to JSON using writeJSON() and imported from JSON using readJSON() method. This enables round-trip interoperability with the corresponding mda.tools web-applications — you can build a model in R, upload it to a browser for interactive use, or develop a model in a web-app and load it into R for predictions.
When a model includes a preprocessing pipeline (via the prep parameter), the pipeline is saved as part of the JSON file and applied automatically on import.
Result objects from pca, pls, and ddsimca now also have a writeCSV() method that exports main outcomes in a format identical to the one produced by the web-applications.
method prep.alsbasecorr() is now several times faster, thanks to spam package (which replaced the previously used Matrix).
method plotResiduals() for all models and results objects where it was available now is named plotDistances(). The old name, however, will work as well to ensure compatibility with old code.
colorbar legend for continuous color grouping now uses pretty breakpoints with consistent decimal formatting; very large or very small values are shown with a compact multiplier (e.g. ×10³).
new method plotEigenvalues() for PCA models shows eigenvalues vs. number of components with optional transform parameter ("none", "log", "sqrt").
pls, which could give wrong y-scores when data contains outliers.prep.norm() with type = "pqn") not storing reference spectrum when used in a preprocessing pipeline — test data was normalized against its own mean instead of the training mean.type = "sd".prep.savgol() now requires polynomial degree between 1 and 4 (was 0–4)."e" in mdaplotg().plotBars() being proportional to the x-position instead of constant.lty parameter being ignored for plot type "b" (scatter-line) in mdaplot().plotErrorbars() ignoring the col parameter for error bar segments.Added cv.scope parameter for PLS, PLS-DA and iPLS methods. The parameter sets the scope for center/scale operations inside cross-validation loop: "global" — centering and scaling will be done using globally computed means and standard deviations, "local" — centering and scaling will be done using locally computed means and standard deviations (for each local calibration set). In other words, in case of the global scope, all cross-validation local models will have the same center as the global one, in case of the local scope, each local model will have its own center in the variable space. The default value is "local", as it was before, so this change will not break your previous code.
Fixed several minor bugs (#111, #112, #114) and added small updates and improvements to documentation.
The changes are relatively small, but some of them can be potentially breaking, hence the version is bumped up to 0.14.0.
Procrustes cross-validation method, pcv(), has been recently improved and extended. It was decided to move it to a separate dedicated R package, pcv. Check GitHub repo for details. The documentation chapter has been updated accordingly.
Fixed a bug related to generating segment indices for Venetian blinds cross-validation for regression. In case of regression, the indices are generating by taking into account the order of the response values. There was a small bug in this implementation, now it is fixed. Remember, that you can always provide manually generated vector og segment indices as value of cv argument.
Made small changes in prep.alsbasecorr() to meet new requirements of the Matrix package. So if you saw warning message from this package last couple of month, this update will fix this.
fixed bug #109
small improvements in documentation.
fixed a bug in method getRegcoeffs(), which did not work correctly with regression models created without scaling or centering.
ipls() got a new logical parameter, full. If full = TRUE the procedure will continue even if no improvement is observed, until the maximum number of iterations is reached. Use it with caution, check the tutorial.
Small fixes and improvements.
This release brings an updated implementation of PLS algorithm (SIMPLS) which is more numerically stable and gives sufficiently less warnings about using too many components in case when you work with small y-values. The speed of pls() method in general has been also improved.
Another important thing is that cross-validation of regression and classification models has been re-written towards more simple solution and now you can also use your own custom splits by providing a vector with segment indices associated with each measurement. For example if you run PLS with parameter cv = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2) it is assumed that you want to use venetian blinds split with four segments and your dataset has 10 measurements. See more details in the tutorial, where description of cross-validation procedure has been moved to a separate section.
Other changes and improvements:
Refactoring and improvements of prep.savgol() code made the method much faster (up to 50-60 times faster for datasets with many measurements).
Refactoring and improvements of prep.alsbasecorr() code made the method 2-3 times faster especially for large datasets.
added new plotting method plotRMSERatio() for regression models (inspired by this post by Barry M. Wise)
added PQN normalization method to prep.norm() function.
fixed a bug in vipscores() which could lead to a bit higher values for PLS2 models.
fixes to several small bugs and general improvements.
This release is mostly about preprocessing - added some new methods, improved the existent once and implemented a possibility to combine preprocessing methods together (including parameter values) and apply them all together in a correct sequence. See preprocessing section in the tutorials for details
method prep.norm() for normalization of spectra (or any other signals) is more versatile now and supports normalization to unit sum, length, area, to height or area under internal standard peak, and SNV. SNV via prop.snv() is still supported for compatibility.
prep.savgol() has been rewritten to fix a minor bug when first derivative was inverted, but also to make the handling of the edge points better. See details in help text for the function and in the tutorial.
added a new method prep.transform() which can be used for transformation of values of e.g. response variable to handle non-linearity.
added a new method prep.varsel() which makes possible to select particular variables as a part of preprocessing framework. For example you can apply baseline correction, normalization and noise suppression to the whole spectra and after that select only a particular part for modelling.
added new method prep() which let you to combine several preprocessing methods and their parameters into a list and use e.g. it as a part of model.
fixed a bug in mcrals() which in rare occasions could lead to a wrong error message.
fixed a bug when attribute yaxis.value was used as ylab when creating line and bar plots.
fixed an earlier reported issue with plotXYResiduals (#100)
function employ() which was used to employ constraints in MCR-ALS has been renamed to employ.constraint(). The function is for internal use and this change should not give any issues in your code.
the user guides have been revised and improved.
Machine$longdouble.eps), which lead to an error when the package is tested on Apple M1.added possibility for providing partially known contributions (parameter cont.forced) or spectral values (parameter spec.forced) to mcrals(). See more in help text and user guide for the package.
added possibility to run iPLS using test set (parameters x.test and y.test) instead of cross-validation.
added a possibility to provide user defined indices of the purest variables in mcrpure() instead of detecting them automatically.
fixed bug #98, which caused a drop of row names when data frame was used as a data source for PCA/SIMCA.
fixed bug #99, which did not allow to use user defined indices of pure variables in mcrpure().
added Procrustes Cross-Validation method, pcv() (it is also available as a separate project).
added Kubelka-Munk transformation for diffuse reflectance spectra (prep.ref2km()).
fixed bug #94 which caused wrong limits in PCA distance plot when outliers are present but excluded.
fixed bug #95 which lead to issues when PLS regression methods (e.g. plotRMSE()) are used for PLS-DA model object.
added additional check that parameter cgroup for plotting functions is provided as a vector or as a factor to avoid confusion.
added link to YouTube channel with Chemometric course based on mdatools package.
fixed an issue, which lead to a bug in simcam.getPerformanceStats, returning implausible and asymmetrical results (thanks to @svonallmen).
fixed a small issue sometimes giving warning when running tests on CRAN (did not influence the user experience though).
mcrpure() method has been modified to avoid potential issues with original patented version.added new method, mcrals(), implementing multivariate curve resolution based on the alternating least squares. The method uses one of the three solvers (OLS, NNLS, FC-NNLS) together with several basic constraints (non-negativity, normalization, closure, etc.). It is also possible to create and use user-defined constraints as well as combine them with the implemented ones.
added new method, mcrpure(), implementing multivariate curve resolution based on the purity approach (also known as SIMPLISMA).
added a new preprocessing method, prep.alsbasecorr(), implementing baseline correction with asymmetric least squares. It preserves all important data arguments similar to other preprocessing methods.
added a new datasets, carbs, with Raman spectra of ribose, glucose and fructose and simulated spectra of their mixtures. The dataset aims at testing and trying the curve resolution methods.
fixed bug #88 which appears when initial number of components in PLS model is too large. From v. 0.10.3 in this case the algorithm warns user and reduces maximum number of components automatically. But if cross-validation is used, sometimes for cross-validation local model this number should be even smaller (because local calibration subset has fewer observations). In this case the pls() method will raise an error and asks user to limit the maximum number of components and run the model again.
main model methods (pls(), pca(), etc.), now do additional check for the consistency of provided datasets.
opacity option in plots.Fixed bug #85 when using y-values as data frame gave an error in PLS regression
Fixed bug #86 and changed the way PLS limits maximum number of components to avoid problems with singular matrices. Now if PLS algorithm finds during calculations that provided number of components is too large, it gives a warning and reduces this number.
Code refactoring and tests for preprocessing methods
categorize.pls() method, which could give wrong results for test set measurements (see issue #82).Small improvements to plotExtreme.pca() so user can specify additional parameters, such as, for example cex. If plot is made for several components, you can now specify just one value for all points (e.g. color of points or marker symbol).
Parameter show.limits in methods plotResiduals.pca(), plotXResiduals.pls(), plotXYResiduals.pls() can now take two logical values — first for extreme limit and second for outlier limit. So, you can show only one of the two limits on the plot. If one value is specified it will be taken for both limits.
New function plotHotellingEllipse() adds Hotelling T^2^ ellipse to any scatter plot (of course it is made first of all for PCA and PLS scores plots). The function works similar to plotConvexHull() and plotConfidenceEllipse(), see help for examples.
Fixed a bug in summary() method for PLS, which worked incorrectly in case of several response variables (PLS2).
Many changes have been made in this version, but most of them are under the hood. Code has been refactored significantly in order to improve its efficiency and make future support easier. Some functionality has been re-written from the scratch. Most of the code is backward compatible, which means your old scripts should have no problem to run with this version. However, some changes are incompatible and this can lead to occasional errors and warning messages. All details are shown below, pay a special attention to breaking changes part.
Another important thing is the way cross-validation works starting from this version. It was decided to use cross-validation only for computing performance statistics, e.g. error of predictions in PLS or classification error in SIMCA or PLS-DA. Decomposition results, such as explained variance or residual distances are not computed for cross-validation anymore. It was a bad idea from the beginning, as the way it has been implemented is not fully correct — distances and variances measured for different local models should not be compared directly. After a long consideration it was decided to implement this part in a more correct and conservative way.
Finally, all model results (calibration, cross-validation and test set validation), are now combined
into a single list, model$res. This makes a lot of things easier. However, the old way of
accessing the result objects (e.g. model$calres or model$cvres) still works, you can access e.g. calibration results both using model$res$cal and model$calres, so this change will not break the compatibility.
Below is more detailed list of changes. The tutorial has been updated accordingly.
Here are changes which can potentially lead to error messages in previously written code.
Cross-validation results are no more available for PCA (as mentioned above), so any use of model$cvres object for PCA model will lead to an error. For the same reason pca() does not take the cv parameter anymore.
Method plotModellingPower() is no longer available (was used for SIMCA models).
Method plotResiduals() is no longer available for SIMCAM models (multiclass SIMCA), use
corresponding method for individual models instead.
Selectivity ratio and VIP scores are not a part of PLS model anymore. This is done to make the calibration of models faster. Use selratio() and vipscores() to compute them. Functions plotSelectivityRatio() and plotVIPScores() are still available but they both compute the values first, which may take a bit of time on large datasets. This change makes parameter light superfluous and it is no more supported in pls().
Other two parameters, which are no more needed when you use pls(), are coeffs.ci and coeffs.alpha. Jack-Knifing based confidence intervals for regression coefficients now automatically computed every time you use cross-validation. You can specify the significance level for the intervals when you either visualize them using plot.regcoeffs() or plotRegcoeffs() for PLS model or when you get the values by using getRegcoeffs().
When you make prediction plot for any classification model, you should specify name of result
object to show the predictions for. In old versions the name of results were "calres", "cvres",
"testres". From this version they have been changed to "cal", "cv" and "test"
correspondingly.
In PLS-DA there was a possibility to show predictions not for classification results but for regression model the PLS-DA is built upon using the following code: plotPredictions(structure(model, class = "pls")). From this version you should use plotPredictions(structure(model, class = "regmodel")) instead, as the plotPredictions() function for regression has been moved from pls class to its parent, more general class, regmodel.
In methods plotCorr() and plotHist() for randomization test, parameter comp has been
renamed to ncomp. Parameter comp assumes a possibility to specify several values as a vector,
while ncomp assumes only one value, which is the case for these two plots.
In regression coefficients plot logical parameter show.line has been replaced with more general show.lines from mdaplot().
plotPredictions() method for models and results is now based on mdaplot (not mdaplotg() as before) and does not support arguments for e.g. legend position, etc.
build:passed on bage in GitHubmdaplot() now returns object with plot data (plotseries class), which can be used for extra options (e.g. adding convex hull).colmap="old" if you don't like it).plotConvexHull() adds convex hull for groups of points on any scatter plot.plotConfidenceEllipse() adds confidence ellipse for groups of points on any scatter plot.opacity can now be used with mdaplotg() plots and be different for each group.mdaplot() and mdaplotg() based plots now can take parameters grid.col and grid.lwd for tuning the grid look.pch=21...25 using col and bg parameters.type="d") is now based on hexagonal binning - very fast for large data (>100 000 rows).mdaplotyy() to create a line plot for two line series with separate y-axis for each.As mentioned above, the biggest change which can potentially lead to some issues with your old code is that cross-validation is no more available for PCA models.
Other changes:
lim.type parameter is "ddmoments" (before it was "jm"). This changes default method for computing critical limits for orthogonal and score distances.setResLimits() is renamed to setDistanceLimits() and has an extra parameter, lim.type which allows to change the method for critical limits calculation without rebuilding the PCA model itself.summary() of PCA model including DoF for distances (Nh and Nq).plotExtreme() is now also available for PCA model (was used only for SIMCA models before).categorize() allowing to categorize data rows as "regular", "extreme" or "outliers" based on residual distances and corresponding critical limits.plotResiduals.simcam() and plotResiduals.simcamres () are not available anymore (both were a shortcut for plotResiduals.simca() which was superfluous.confint() which returns confidence interval (if corresponding statistics are available).show.line is replaced with show.lines from mdaplot()).As mentioned above, the PLS calibration has been simplified, thus selectivity ratio and VIP scores are not computed automatically when PLS model is created. This makes the calibration faster and makes parameter light unnecessary (removed). Also Jack-Knifing is used every time you apply cross-validation, there is no need to specify parameters coeffs.alpha and coeffs.ci anymore (both parameters have been removed). It does not lead to any additional computational time and therefore it was decided to do it automatically.
Other changes are listed below:
summary() output has been slightly improved.plotWeights() for creating plot with PLS weights.selratio().getSelectivityRatio() is deprecated and shows warning (use selratio() instead).plotSelectivityRatio() computes the ratio values first, which makes it a bit slower.vipscores().getVIPScores() is deprecated and shows warning (use vipscores() instead).plotVIPScores() computes the score values first, which makes it a bit slower."ven") now takes into account the order of response values, so there is no need to order data rows in advance.lim.type parameter (default value "ddsimca"). X-residuals plot show the limits.plotXYResiduals() showing distance/residuals plot for both X (full distance) and Y.categorize() allowing to categorize data rows based on PLS results and critical limits computed for X- and Y-distance.regres methodsregres methodscv)Y cumexpvarcex parameter for group plots (can be specified differently for each group)cex is specified it will be also applied for legend itemsmax.cov in prep.autoscale() (#59)ipls() method plus fixed a bug preventing breaking the selection loop (#56)selectCompNum() related to use of Wold criterion (#57)max.cov parameter in prep.autoscale() (#58)max.cov value in prep.autoscale() is set to 0 (to avoid scaling only of constant variables)prep.autoscale()opacity parameter for semi-transparent colorsplotExtreme() method for SIMCA modelssetResLimits() method for PCA/SIMCA modelsplotProbabilities() method for SIMCA resultsgetConfusionMatrix() method for classification resultsplotPrediction() for PLS resultsplotPrediction() for PLS resultspls.getRegCoeffs() now also returns standard error and confidence intervals calculated for unstandardized variablessummary() for object with regression coefficients (regcoeffs)mdaplot for data frame with one or more factor columns, the factors are now transofrmed to dummy variables (before it led to an error)mdaplots when using factor with more than 8 levels for color grouping led to an errorpca with wrong calculation of eigenvalues in NIPALS algorithmlab.cex and lab.col now are also applied to colorbar labelsdocs foldermdaplot() and mdaplotg() were rewritten completely and now are more easy to use (check tutorial)'d') for density scatter plotxlas and ylas in plots to rotate axis ticksplotBiplot())cgroup) if no there is no test setprep.autoscale() now do not scale columns with coefficient of variation below given thresholdprep.norm)getRegcoeffs was added to PLS modelcgroup for plots now can work with factors correctly (including ones with text levels)lab.col and lab.cex for changing color and font size for data point labels?randtest?crossvalroxygen2 packageclassres class for representation and visualisation of classification resultsxticklabels and yticklabels to mdaplot and mdaplotg functionssimca and simcares classes for one-class SIMCA model and resultssimcam and simcamres classes for multiclass SIMCA model and resultsplsdaand plsdaresclasses for PLS-DA model and resultsselectNumComp(model, ncomp) instead
of pls.selectncomp(model, ncomp), test.x ad test.y instead of Xt and yt, finally separate logical
arguments center and scale are used instead of previously used autoscale. By default scale = F and center = T.?pls)mdaplot or mdaplotg functions, which extend basic functionality of R plots. For example,
they allow to make color groups and colorbar legend, calculate limits automatically depending on
elements on a plot, make automatic legend and many other things.