LogitBoost.Rd
Train logitboost classification algorithm using decision stumps (one node decision trees) as weak learners.
LogitBoost(xlearn, ylearn, nIter=ncol(xlearn))
A matrix or data frame with training data. Rows contain samples and columns contain features
Class labels for the training data samples.
A response vector with one label for each row/component of xlearn
.
Can be either a factor, string or a numeric vector.
An integer, describing the number of iterations for which boosting should be run, or number of decision stumps that will be used.
The function was adapted from logitboost.R function written by Marcel
Dettling. See references and "See Also" section. The code was modified in
order to make it much faster for very large data sets. The speed-up was
achieved by implementing a internal version of decision stump classifier
instead of using calls to rpart
. That way, some of the most time
consuming operations were precomputed once, instead of performing them at
each iteration. Another difference is that training and testing phases of the
classification process were split into separate functions.
An object of class "LogitBoost" including components:
List of decision stumps (one node decision trees) used:
column 1: feature numbers or each stump, or which column each stump operates on
column 2: threshold to be used for that column
column 3: bigger/smaller info: 1 means that if values in the column
are above threshold than corresponding samples will be labeled as
lablist[1]
. Value "-1" means the opposite.
If there are more than two classes, than several "Stumps" will be
cbind
'ed
names of each class
Dettling and Buhlmann (2002), Boosting for Tumor Classification of Gene Expression Data.
predict.LogitBoost
has prediction half of LogitBoost code
logitboost
function from logitboost library (not in CRAN
or BioConductor is very similar but much
slower on very large datasets. It also perform optional cross-validation.
data(iris)
Data = iris[,-5]
Label = iris[, 5]
# basic interface
model = LogitBoost(Data, Label, nIter=20)
Lab = predict(model, Data)
Prob = predict(model, Data, type="raw")
t = cbind(Lab, Prob)
t[1:10, ]
#> Lab setosa versicolor virginica
#> [1,] 1 1 0.017986210 1.522998e-08
#> [2,] 1 1 0.002472623 3.353501e-04
#> [3,] 1 1 0.017986210 8.315280e-07
#> [4,] 1 1 0.002472623 4.539787e-05
#> [5,] 1 1 0.017986210 1.522998e-08
#> [6,] 1 1 0.017986210 1.522998e-08
#> [7,] 1 1 0.017986210 8.315280e-07
#> [8,] 1 1 0.017986210 1.522998e-08
#> [9,] 1 1 0.002472623 3.353501e-04
#> [10,] 1 1 0.002472623 4.539787e-05
# two alternative call syntax
p=predict(model,Data)
q=predict.LogitBoost(model,Data)
pp=p[!is.na(p)]; qq=q[!is.na(q)]
stopifnot(pp == qq)
# accuracy increases with nIter (at least for train set)
table(predict(model, Data, nIter= 2), Label)
#> Label
#> setosa versicolor virginica
#> setosa 48 0 0
#> versicolor 0 45 1
#> virginica 0 3 45
table(predict(model, Data, nIter=10), Label)
#> Label
#> setosa versicolor virginica
#> setosa 50 0 0
#> versicolor 0 47 0
#> virginica 0 1 47
table(predict(model, Data), Label)
#> Label
#> setosa versicolor virginica
#> setosa 50 0 0
#> versicolor 0 49 0
#> virginica 0 0 48
# example of spliting the data into train and test set
mask = sample.split(Label)
model = LogitBoost(Data[mask,], Label[mask], nIter=10)
table(predict(model, Data[!mask,], nIter=2), Label[!mask])
#>
#> setosa versicolor virginica
#> setosa 17 0 0
#> versicolor 0 17 1
#> virginica 0 0 12
table(predict(model, Data[!mask,]), Label[!mask])
#>
#> setosa versicolor virginica
#> setosa 17 0 0
#> versicolor 0 17 2
#> virginica 0 0 15