We give a demonstration of the capabilities of liquidSVM from an R viewpoint. More detailed information can be found in the documentation and in the help (e.g. `?svm`

).

Disclaimer:liquidSVM and the R-bindings are in general quite stable and well tested by several people. However, use in production is at your own risk.If you run into problems please check first the documentation for more details, or report the bug to the maintainer.

To install and load the library, start an R session and type

```
install.packages("liquidSVM",repos="http://www.isa.uni-stuttgart.de/software/R")
library(liquidSVM)
```

**LS-Regression:**

```
# Load test and training data
reg <- liquidData('reg-1d')
```

Now `reg$train`

contains the training data and `reg$test`

the testing data. Both have the labels in its first column `Y`

and the feature is in column `X1`

. To train on the data and select the best hyperparameters do (`display=1`

gives some information as the training progresses)

`model <- svm(Y~., reg$train)`

Now you can test with any test set:

```
result <- test(model, reg$test)
errors(result)
#> 0.00541
```

We also can plot the regression:

```
plot(reg$train$X1, reg$train$Y,pch='.', ylim=c(-.2,.8), ylab='', xlab='', axes=F)
curve(predict(model, x),add=T,col='red')
```

As a convenience, since `reg`

already contains `$train`

and `$test`

you can do the whole experiment in one line. Then the result is stored in `model$last_result`

:

```
model <- svm(Y~., reg, display=1)
errors(model$last_result)[1]
#> 0.00541
```

**Multi-class:**

```
banana <- liquidData('banana-mc')
banana
#> Smldata "banana-mc" with 4000 train samples and 4000 test samples
#> having 3 columns named Y,X1,X2
#> target "Y" factor with 4 levels: 1 (1200 samples) 2 (1200 samples) 3 (800 samples) ...
```

Since `banana$train$Y`

is a factor the following performs multi-class classification

```
model <- svm(Y~., banana$train)
plot(banana$train$X1, banana$train$X2,pch='o', col=banana$train$Y, ylab='', xlab='', axes=F)
x <- seq(-1,1,.05)
z <- matrix(predict(model,expand.grid(x,x)),length(x))
contour(x,x,z, add=T, levels=1:4,col=1,lwd=4)
```

In this case `errors(...)`

shows both the global miss-classification error as well the errors of the underlying binary tasks, for more details see Multiclass classification:

```
errors(test(model,banana$test))
#> result 1vs2 1vs3 1vs4 2vs3 2vs4 3vs4
#> 0.218000 0.141250 0.112500 0.095500 0.075500 0.075000 0.000625
```

**Cells:** if data gets too big for the memory on your machine:

```
covtype <- liquidData('covtype.5000')
model <- svm(Y~., covtype, display=1, useCells=TRUE)
errors(model$last_result)
#> 0.192
```

Download and installation can be done from a running R session by

`install.packages("liquidSVM", repos="http://www.isa.uni-stuttgart.de/software/R")`

If you have CUDA installed e.g. on path `/usr/local/cuda`

you can activate GPU support at installation by

```
install.packages('liquidSVM',configure.args="native /usr/local/cuda",
repos="http://www.isa.uni-stuttgart.de/software/R")
```

At the moment this is not tested on Windows!

Once the package is installed you can open the very vignette you are reading by issuing:

`vignette('demo',package='liquidSVM')`

After Installation you have to load the package

`library(liquidSVM)`

To get started we consider two problems using R-builtin datasets:

Regression with the

`trees`

data set (31 samples, 3 variables) and we use the formulaThis means that`Height ~ Girth + Volume`

`Height`

of trees should be explained by both their`Girth`

and`Volume`

.Multi-classification with the

`iris`

data set (150 samples, 5 variables):`Species ~ .`

The factor

`Species`

has 3 levels`setosa, versicolor`

, and`virginica`

. Since we are explaining a factor this yields by default a multi-classification problem. Note that the dot on the right hand side means that all the other variables of`iris`

are to be used to explain. Hence this is equivalent to:`Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width`

For the most simple out of the box training use `svm(...)`

. If nothing else is specified then the explanatory variable determines what kind of problem is getting addressed:

```
modelTrees <- svm(Height ~ Girth + Volume, trees) # least squares
modelIris <- svm(Species ~ ., iris) # multiclass classification
#> Warning in (function (model, command.args = NULL, ..., d = NULL) : Solution
#> may not be optimal: try training again using max_gamma=25
```

This trains a model and can take some time for bigger data sets. It issues 5-fold cross-validation on a grid of 10 bandwidth parameters gamma and 10 cost parameters lambda. Also the best parameters are selected by default.

Hence after this one can calculate predictions for the model as usual in `R`

:

```
predict(modelTrees, trees[21:31, ])
#> [1] 78.8 77.4 76.1 74.3 76.3 80.9 81.1 81.2 79.7 79.4 86.3
predict(modelIris, iris[3:8, ])
#> [1] setosa setosa setosa setosa setosa setosa
#> Levels: setosa versicolor virginica
```

The actual values that liquidSVM produces can vary on several accounts: There are random components in the solving process and also different architectures yield different optimization strategies and hence different near-to-optimal solutions. This holds for

all the outputsin this document.

The training on the full iris data gives a perfect fit:

```
all.equal( predict(modelIris, iris[3:8, ]) , iris$Species[3:8] )
#> [1] TRUE
```

The better approach to machine learning is to split the samples into a training set and a test set. Here we split earthquakes on fiji (1000 seismic events, 5 numerical values) in a training set of size

`qu <- ttsplit(quakes)`

Then we learn on the training set:

`model <- svm(mag ~ ., qu$train)`

and test on the testing set:

```
result <- test(model, qu$test)
errors(result)
#> val_error
#> 0.03
```

Since `qu`

already has both a training and a testing data set, we can do train/select and test in one invocation (and give more information using `display=1`

)

```
model <- lsSVM(mag ~ . , qu, display=1)
#> [...]
#> Warning: The best gamma was 0 times at the lower boundary and 5 times at the
#> upper boundary of your gamma grid. 5 times a gamma value was selected.
#> [...]
errors(model$last_result)
#> val_error
#> 0.109
```

This error feels to big, but there is some information that the gamma-grid is to small. Hence we use bigger and bigger grids:

```
errors(lsSVM(mag ~ . , qu, max_gamma=100)$last_result)
#> val_error
#> 0.0367
```

and this is much better.

Solutions can be saved as in the following examples (taken from the examples of `read.liquidSVM`

):

```
banana <- liquidData('banana-bc')
modelOrig <- mcSVM(Y~., banana$train)
write.liquidSVM(modelOrig, "banana-bc.fsol")
write.liquidSVM(modelOrig, "banana-bc.sol")
clean(modelOrig) # delete the SVM object
# now we read it back from the file
modelRead <- read.liquidSVM("banana-bc.fsol")
# No need to train/select the data!
errors(test(modelRead, banana$test))
# to read the model where no data was saved we have to make sure, we get the same training data:
banana <- liquidData('banana-bc')
# then we can read it
modelDataExternal <- read.liquidSVM("banana-bc.sol", Y~., banana$train)
result <- test(modelDataExternal, banana$test)
# to serialize an object use:
banana <- liquidData('banana-bc')
modelOrig <- mcSVM(Y~., banana$train)
# we serialize it into a raw vector
obj <- serialize.liquidSVM(modelOrig)
clean(modelOrig) # delete the SVM object
# now we unserialize it from that raw vector
modelUnserialized <- unserialize.liquidSVM(obj)
errors(test(modelUnserialized, banana$test))
```

A major issue with SVMs is that for larger sample sizes the kernel matrix does not fit into the memory any more. Classically this gives an upper limit for the class of problems that traditional SVMs can handle without significant runtime increase. The concept of cells makes it possible to circumvent these issues.

If you specify `useCells=TRUE`

then the sample space \(X\) gets partitioned into a number of cells. The training is done first for cell 1 then for cell 2 and so on. Now, to predict the label for a value \(x\in X\) liquidSVM first finds out to which cell this \(x\) belongs and then uses the SVM of that cell to predict a label for it:

```
banana <- liquidData('banana-bc')
model <- mcSVM(Y~.,banana$train, voronoi=c(4,500))
centers <- getCover(model)$samples
plot(banana$train[,2:3],col=banana$train$Y)
points(centers,pch='x',cex=2,col=3)
if(require(deldir)){
voronoi <- deldir::deldir(centers$X1,centers$X2,rw=c(range(banana$train$X1),range(banana$train$X2)))
plot(voronoi,wlines="tess",add=TRUE, lty=1)
text(centers$X1,centers$X2,1:nrow(centers),pos=1)
}
```

We first consider a medium size sample of the `covtype`

data set. `liquidData`

will download this from http://www.isa.uni-stuttgart.de/liquidData/:

```
co <- liquidData('covtype.10000')
system.time(svm(Y~., co$train, threads=3))
#> user system elapsed
#> 28.208 0.124 11.191
```

The user time is about three times the elapsed time since we are using 3 threads.

By using the partitioning facility of liquidSVM you can even bigger problems:

```
co <- liquidData('covtype.50000')
system.time(svm(Y~.,co$train,useCells=TRUE,threads=3))
#> user system elapsed
#> 252.395 1.076 98.119
```

Note that with this data set `useCells=F`

here only works if your system has enough free memory (~26GB).

Even the full `covtype`

data set with over 460’000 rows (about 110’000 samples retained for testing) is now treatable in under 7 minutes from within R:

```
co <- liquidData('covtype-full')
system.time(svm(Y~.,co$train,useCells=TRUE,threads=3))
#> user system elapsed
#> 1383.535 4.752 397.559
```

If you have less than 10GB of RAM use `store_solutions_internally=FALSE`

for the latter.

If you run into memory issues turn cells on:

`useCells=TRUE`

liquidSVM also is able to calculate the kernel on a GPU if it is compiled with CUDA-support. Since there is a big overhead in moving the kernel matrix from the GPU memory, this is most useful for problems with many dimensions. We take here the Gisette data set which takes the digits `4`

and `9`

from the standard MNIST data of handwritten digits and adds some new attributes to obtain 6000 samples and 5000 attributes.

First we load the data into a liquidSVM-model:

```
gi <- liquidData('gisette')
model <- init.liquidSVM(Y~.,gi$train)
```

Now we train the model with and without GPU to compare:

```
system.time(trainSVMs(model,d=1,gpus=1,threads=1))
#> user system elapsed
#> 57 10 67
system.time(trainSVMs(model,d=1,gpus=0,threads=4))
#> user system elapsed
#> 392 1 110
```

Note that liquidSVM uses only as much threads as GPUs if GPUs are used at all, hence there is no elapsed time gain in comparison to 4 threads:

```
system.time(trainSVMs(model,d=1,gpus=1,threads=4))
#> user system elapsed
#> 94 42 67
system.time(trainSVMs(model,d=1,gpus=0,threads=1))
#> user system elapsed
#> 327 1 329
```

`libsvm`

We use `e1071::svm`

, which is a binding for `libsvm`

. You can install it from CRAN by using

`install.packages("e1071")`

liquidSVM’s out of the box behavior is to use cross-validation:

```
folds <- 5
co <- liquidData('covtype.1000')
system.time(ours <- svm(Y~., co$train, folds=folds, threads=2))
#> user system elapsed
#> 1.525 0.016 0.958
```

The user-time is about twice the elapsed since we use 2 threads here.

How can the same be achieved in `e1071::svm`

? First the parameter-search grid has to be converted:

```
GAMMA <- 1/(ours$gammas)^2
COST <- 1/(2 * (folds-1)/folds * nrow(co$train) * ours$lambdas)
```

And then we use `e1071::tune.svm`

to perform the cross-validation

```
system.time(e1071::tune.svm(Y~., data=co$train, gamma=GAMMA,cost=COST, scale=F, e1071::tune.control(cross=folds)))
#> user system elapsed
#> 382.364 0.832 385.521
```

Now, for bigger datasets this gives a consistent picture:

```
co <- liquidData('covtype.5000') # ca. 5000 rows
system.time(ours <- svm(Y~., co$train, folds=folds, threads=2))
#> user system elapsed
#> 30.237 0.120 15.676
system.time(e1071::tune.svm(Y~., data=co$train, gamma=GAMMA,cost=COST, scale=F, e1071::tune.control(cross=folds)))
#> user system elapsed
#> 11199.732 4.324 11238.407
```

liquidSVM is better also if training is done only on one grid:

```
co <- liquidData('covtype.10000') # ca. 10000 rows
gamma <- 3.1114822
cost <- 0.01654752
system.time(ours <- svm(Y ~ ., co$train, g=c("[",gamma,"]"), l=c("[",cost,"]",1),folds=1,threads=4,d=1))
#> user system elapsed
#> 4.836 0.356 2.134
system.time(theirs <- e1071::svm(Y~., co$train, gamma=1/gamma^2,cost=cost, scale=F))
#> user system elapsed
#> 26.502 0.032 26.618
```

This 10-times speed-up is possible on one hand due to more efficient computations of the kernel matrix and on the other hand due to liquidSVM’s faster optimization algorithm.

Again for bigger problems this holds true:

```
co <- liquidData('covtype.35000') # ca. 35000 rows
system.time(ours <- svm(Y ~ ., co$train, g=c("[",gamma,"]"), l=c("[",cost,"]",1),folds=1,threads=4,d=1))
#> user system elapsed
#> 99.830 4.544 36.949
system.time(theirs <- e1071::svm(Y~., co$train, gamma=1/gamma^2,cost=cost, scale=F))
#> user system elapsed
#> 330.557 0.176 331.834
```

To compare to the same big data set using full 10x10 grid search but just with one fold is still quicker:

```
system.time(ours <- svm(Y ~ ., co$train,folds=1,threads=4,d=0))
#> user system elapsed
#> 816.475 5.164 225.934
```

Basically in the time libsvm does one solution you get the whole grid search for free if you use liquidSVM.

liquidSVM organizes its work into tasks: E.g. in multiclass classification the problem has to be reduced into several binary classification problems. Or in Quantile regression, the SVM is learned simultaneously for different weights and then the selection of hyperparameters produces different tasks.

Behind the scenes `svm(formula, data, ...)`

does the following:

```
model <- init.liquidSVM(formula, data)
trainSVMs(model, ...)
selectSVMs(model)
```

The following learning scenarios hide these in higher level functions.

Multiclass classification has to be reduced to binary classification There are two strategies for this:

- all-vs-all: for every pairing of classes a binary SVM is trained
- one-vs-all: for every class a binary SVM is trained with that class as one label and all other classes are clumped together to another label

Then for any point in the test set, the winning label is chosen. A second choice to make is whether the hinge or the least-square loss should be used for the binary classification problems.

Let us look at the example dataset `banana-mc`

which has 4 labels:

Since there are 6 pairings, `AvA`

trains 6 tasks, whereas `OvA`

trains 4 tasks:

```
banana <- liquidData('banana-mc')
model <- mcSVM(Y~., banana, mc_type="AvA_hinge")
errors(model$last_result)
#> result 1vs2 1vs3 1vs4 2vs3 2vs4 3vs4
#> 0.217250 0.142083 0.111500 0.092500 0.073500 0.073500 0.000625
model$last_result[1:3,]
#> result 1vs2 1vs3 1vs4 2vs3 2vs4 3vs4
#> 1 1 1 1 1 2 4 4
#> 2 4 1 1 4 2 4 4
#> 3 4 1 1 4 2 4 4
model <- mcSVM(Y~., banana, mc_type="OvA_ls")
errors(model$last_result)
#> result 1vsOthers 2vsOthers 3vsOthers 4vsOthers
#> 0.2147 0.1545 0.1227 0.0777 0.0737
model$last_result[1:3,]
#> result 1vsOthers 2vsOthers 3vsOthers 4vsOthers
#> 1 1 0.99149 -0.964 -0.924 -0.928
#> 2 4 -0.45494 -1.000 -0.994 0.387
#> 3 1 -0.00657 -0.991 -0.993 -0.111
model <- mcSVM(Y~., banana, mc_type="AvA_ls")
errors(model$last_result)
#> result 1vs2 1vs3 1vs4 2vs3 2vs4 3vs4
#> 0.212500 0.140000 0.107000 0.089500 0.074000 0.074000 0.000625
model$last_result[1:3,]
#> result 1vs2 1vs3 1vs4 2vs3 2vs4 3vs4
#> 1 1 -0.963 -0.979 -0.9966 -0.605 0.894 0.995
#> 2 4 -0.753 -0.998 0.5268 -0.953 1.000 1.000
#> 3 1 -0.996 -1.000 -0.0506 -0.894 1.000 1.000
# For completeness the following is also possible even though you should not use it:
model <- mcSVM(Y~., banana, mc_type="OvA_hinge")
errors(model$last_result)
#> result 1vsOthers 2vsOthers 3vsOthers 4vsOthers
#> 0.2235 0.1555 0.1275 0.0795 0.0750
model$last_result[1:3,]
#> result 1vsOthers 2vsOthers 3vsOthers 4vsOthers
#> 1 1 1.000 -0.829 -0.720 -0.981
#> 2 4 -0.876 -0.995 -0.740 0.923
#> 3 4 -0.202 -0.987 -0.729 0.198
```

The first element in the errors gives the overall test error. The other errors correspond to the tasks. Also the result displays in the first column the final decision for a test sample, and in the other columns the results of the binary classifications. One can see nicely how the final prediction vote for any sample is based on the 4 or 6 binary tasks.

NOTE`AvA`

is usually faster, since every binary SVM just trains on the data belonging to only two labels. On the other hand`OvA_ls`

can give better results at the cost of longer training time.OvA_hinge should not be used as it is not universally consistent.

This uses the quantile solver with pinball loss and performs selection for every quantile provided.

```
reg <- liquidData('reg-1d')
quantiles_list <- c(0.05, 0.1, 0.5, 0.9, 0.95)
model <- qtSVM(Y ~ ., reg$train, weights=quantiles_list)
result_qt <- test(model,reg$test)
errors(result_qt)
#> [1] 0.00714 0.01192 0.02682 0.01251 0.00734
```

Now we plot this:

```
I <- order(reg$test$X1)
par(mar=rep(.1,4))
plot(Y~X1, reg$test[I,],pch='.', ylim=c(-.2,.8), ylab='', xlab='', axes=F)
for(i in 1:length(quantiles_list))
lines(reg$test$X1[I], result_qt[I,i], col=i+1)
```

In this plot you see estimations for two lower and upper quantiles as well as the median of the distribution of the label \(y\) given \(x\).

This uses the expectile solver with weighted least squares loss and performs selection for every weight. The 0.5-expectile in fact is just the ordinary least squares regression and hence estimates the mean of \(y\) given \(x\). And in the same way as quantiles generalize the median, expectiles generalize the mean.

```
reg <- liquidData('reg-1d')
expectiles_list <- c(0.05, 0.1, 0.5, 0.9, 0.95)
model <- exSVM(Y ~ ., reg$train, weights=expectiles_list)
result_ex <- test(model, reg$test)
errors(result_ex)
```

Now we plot this:

```
I <- order(reg$test$X1)
par(mar=rep(.1,4))
plot(Y~X1, reg$test[I,],pch='.', ylim=c(-.2,.8), ylab='', xlab='', axes=F)
for(i in 1:length(expectiles_list))
lines(reg$test$X1[I], result_ex[I,i], col=i+1)
legend('bottomright', col=6:2, lwd=1, legend=expectiles_list[5:1])
```

Neyman-Pearson-Learning attempts classification under the constraint that the probability of false positives (Type-I error) is bound by a significance level alpha, which is called here the NPL-constraint.

```
banana <- liquidData('banana-bc')
npl_constraints <- c(0.025,0.033,0.05,0.075,0.1)
# class=-1 specifies the normal class
model <- nplSVM(Y ~ ., banana, class=-1, constraint.factor=npl_constraints,threads=0,display=1)
result_npl <- model$last_result
errors(result_npl)
#> [1] 0.437 0.437 0.322 0.308 0.230
```

Now how did we do?

```
false_alarm_rate <- apply(result_npl[banana$test$Y==-1,]==1,2,mean)
detection_rate <- apply(result_npl[banana$test$Y==1,]==1,2,mean)
rbind(npl_constraints,false_alarm_rate,detection_rate)
#> 1 2 3 4 5
#> npl_constraints 0.025 0.0333 0.050 0.075 0.100
#> false_alarm_rate 0.037 0.0390 0.053 0.058 0.086
#> detection_rate 0.563 0.5630 0.678 0.692 0.770
```

You can see that the false alarm rate in the test set meet the NPL-constraints quite nicely, and on the other hand the the detection rate is increasing.

Receiver Operating Characteristic curve (ROC curve) plots trade-off between the false alarm rate and the detection rate for different weights (default is 9 weigts).

```
banana <- liquidData('banana-bc')
model <- rocSVM(Y ~ ., banana$train, threads=0,display=1)
result_roc <- test(model, banana$test)
```

Now you could quickly plot this curve using `plotROC(model, banana$test)`

but let’s do it by hand:

```
false_positive_rate <- apply(result_roc[banana$test$Y==-1,]==1,2,mean)
detection_rate <- apply(result_roc[banana$test$Y==1,]==1,2,mean)
plot(false_positive_rate, detection_rate, xlim=0:1,ylim=0:1,asp=1, type='b', pch='x')
abline(0,1,lty=2)
```

This shows nice learning, since the ROC curve is near the north-west corner.

A quicker way to calculate an ROC curve is to use least-squares regression which estimates the conditional probability at any point. For this we use the same method `plotROC`

:

```
model.ls <- lsSVM(Y~.,banana$train)
plotROC(model.ls, banana$test, xlim=0:1,ylim=0:1,asp=1, type='l')
points(false_positive_rate, detection_rate, pch='x', col='red')
```

And we see that both methods give almost the same results.

We provide an interface to calculating the kernel matrix of features:

```
covtype <- liquidData("covtype.1000")$train[1:100,-1]
a <- kern(covtype)
a[1:4,1:4]
#> [,1] [,2] [,3] [,4]
#> [1,] 1.0000 0.0159 0.10593 0.01112
#> [2,] 0.0159 1.0000 0.01296 0.06971
#> [3,] 0.1059 0.0130 1.00000 0.00755
#> [4,] 0.0111 0.0697 0.00755 1.00000
image(liquidSVM::kern(covtype, gamma=1.1, type="gauss"))
```

`image(liquidSVM::kern(covtype, gamma=1.1, type="poisson"))`

As a convenience we provide several datasets prepared for training and testing.

They can be imported by name e.g. using:

`liquidData('reg-1d')`

This loads both `reg-1d.train.csv`

as well as `reg-1d.test.csv`

into `reg$train`

and `reg$test`

respectively.

liquidData sets have a strict format, they are comma-separated values and no header. The first column is the label. It gets the variable name `Y`

. The other columns are the features and get variable names `X1, X2,`

…

You can also load samples as in the following examples

```
# take 10% of training and testing data
liquidData('reg-1d', prob=0.1)
# a sample of 400 train samples and the same relative size of test samples
liquidData('reg-1d', trainSize=400)
# a sample of 400 train samples and all test samples
liquidData('reg-1d', trainSize=400, testSize=Inf)
```

The sampling is done stratified by default if the target `Y`

is a factor.

Before getting these data sets from our website, `liquidData`

first tries some directories in the filesystem (configured by the character vector of locations in parameter `loc=`

):

- the working directory
`getwd()`

- in your home directory
`"~/liquidData"`

. In Windows,`~`

typically is`C:\Users\username\Documents`

- (some directories which make only sense if you are working at ISA)
- The webpage
`http://www.isa.uni-stuttgart.de/liquidData`

The data sets can be gzip-ped, which is recognized by the additional extension `.gz`

, e.g. `reg-1d.train.csv.gz`

and `reg-1d.test.csv.gz`

If you want to split any `data.frame`

into train/test and have it in the same format as above, use `ttsplit(...)`

. You can write such a model to your filesystem by use of `write.liquidData`

.