Key Words/Concepts of Week 3

This week focuses on the prediction of categorical variables.

Decision Tree

Key ideas

  • iteratively split variables into groups
  • evaluate “homogeneity” within each group
  • split again if necessary

features

  1. classfication trees perform better in non-linear settings and easy to interpret
  2. data transformation less important: monotonic transformation will not change outcome
  3. results may vary and harder to estimate uncertainty
  4. it may lead to overfitting without pruning/cross-validation

basic algorithm

  1. Start with all variables in one group.
  2. Find the variables/split that best separates the outcomes.
  3. Divide the data into two groups (“leaves”) on that split (“node”).
  4. Within each split, find the best variable/split that separates the outcomes (with all the variables)
  5. Continue until the groups are too small or sufficiently “pure”.

measures of homogeneity (or more precise, impurity)

  1. misclassification error (0 is perfect purity and 0.5 is no purity)
  2. gini index (0 is perfect purity and 0.5 no purity)
  3. deviance/information gain (or “entropy”) - 0 is perfect purity and 1 is no purity

Example of Decision Tree Model

data(iris); library(ggplot2); library(caret)
names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"
table(iris$Species) #try to predict Species
## 
##     setosa versicolor  virginica 
##         50         50         50
inTrain<-createDataPartition(y=iris$Species, p=0.7, list=FALSE)
training<-iris[inTrain, ]
testing<-iris[-inTrain, ]
dim(training); dim(testing)
## [1] 105   5
## [1] 45  5
# see clearly three groups which may be hard for linear model fitting
qplot(Petal.Width, Sepal.Width, colour=Species, data=training)

library(caret)
modFit<-train(Species~., method="rpart", data=training) #r package in partition tree method
print(modFit$finalModel)
## n= 105 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 105 70 setosa (0.33333333 0.33333333 0.33333333)  
##   2) Petal.Length< 2.45 35  0 setosa (1.00000000 0.00000000 0.00000000) *
##   3) Petal.Length>=2.45 70 35 versicolor (0.00000000 0.50000000 0.50000000)  
##     6) Petal.Length< 4.85 34  2 versicolor (0.00000000 0.94117647 0.05882353) *
##     7) Petal.Length>=4.85 36  3 virginica (0.00000000 0.08333333 0.91666667) *
#an alterative way to see - graph
plot(modFit$finalModel, uniform=TRUE, main="Classification Tree") # dendogram
text(modFit$finalModel, use.n=TRUE, all=TRUE, cex=0.8)

# prettier plots and need the "rattle" package
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
fancyRpartPlot(modFit$finalModel)

# make prediction using the decision tree model
table(predict(modFit, newdata=testing), testing$Species)
##             
##              setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         14         1
##   virginica       0          1        14

Bagging (bootstrap aggregating)

Key ideas

  1. Resample observations and recalculate predictions with a type of model (which can be decision tree)
  2. Use weighted average of models from the previous step or use majority vote for making predictions of an outcome

The advantage of bagging is that it has similar bias as compared to a single model, but variance of the prediction is reduced. It has shown that bagging is more useful in non-linear setting.

Example of Bagging

library(ElemStatLearn); data(ozone, package="ElemStatLearn")
## 
## Attaching package: 'ElemStatLearn'
## The following object is masked _by_ '.GlobalEnv':
## 
##     spam
ozone<-ozone[order(ozone$ozone), ] #try to predict temperature as a function of ozone
str(ozone)
## 'data.frame':    111 obs. of  4 variables:
##  $ ozone      : num  1 4 6 7 7 8 9 9 9 10 ...
##  $ radiation  : int  8 25 78 48 49 19 24 36 24 264 ...
##  $ temperature: int  59 61 57 80 69 61 81 72 71 73 ...
##  $ wind       : num  9.7 9.7 18.4 14.3 10.3 20.1 13.8 14.3 10.9 14.3 ...
#10 random samples with replacement from dataset 
#ozone variable ranges from 1 to 168 and we make prediction using 155 ozone numbers
ll<-matrix(NA, nrow=10, ncol=155) 

for (i in 1:10){
    ss<-sample(1:dim(ozone)[1], replace=T)
    ozone0<-ozone[ss, ]; ozone0<-ozone0[order(ozone0$ozone), ]
    #similar to spline, loess produces a smooth curve
    loess0<-loess(temperature~ozone, data=ozone0, span=0.2) #span option controls how smooth the loess curve is 
    ll[i, ]<-predict(loess0, newdata=data.frame(ozone=1:155)) 
}

#data points 
plot(ozone$ozone, ozone$temperature, pch=19, cex=0.5)
#the ten estimation lines from bootstrap samples
for (i in 1:10) {lines(1:155, ll[i, ], col="grey", lwd=2)} 
#the average estimation (bagging) line
lines(1:155, apply(ll, 2, mean), col="red", lwd=2)

In the train function, other models are available for bagging. Use method options to choose bagEarth, treebag, and bagFDA etc..

Alternatively, you can build your own function using the bag function.

library(party)
predictors=data.frame(ozone=ozone$ozone)
temperature=ozone$temperature
treebag<-bag(predictors, temperature, B=10, bagControl=bagControl(fit=ctreeBag$fit, predict=ctreeBag$pred, aggregate=ctreeBag$aggregate))

#data points
plot(ozone$ozone, temperature, col="lightgrey", pch=19)
#first tree estimate
points(ozone$ozone, predict(treebag$fits[[1]]$fit, predictors), pch=19, col="red")
#combined 10 tree estimates
points(ozone$ozone, predict(treebag, predictors), pch=19, col="blue")

Random Forest

Bagging is often used with trees - an extension is random forests, which is highly accurate and popular in many data science contests.

Key Idea

  1. Similar to bagging it uses bootstrap samples.
  2. The difference is: at each split, the algorithm bootstraps only a subset of variables, which allows for a diverse number of trees.
  3. From the diverse multiple trees, vote or use weighted average to get the final outcome

The drawback is the method can be slow to run on computer, but you can use parallel computing. And, it may lead to overfitting, in which case, it is hard to tell which tree leads to the problem). It is a combination of many tree models, it can be harder to interpret the decision rule.

Example of Random Forest

data(iris); library(ggplot2)
inTrain <- createDataPartition(y=iris$Species, p=0.7, list=FALSE)
training <- iris[inTrain,]
testing <- iris[-inTrain,]

library(caret); library(randomForest)
modFit <- train(Species~ .,data=training,method="rf",prox=TRUE) #"rf" is the random forest method; prox=TRUE produces extra info

getTree(modFit$finalModel,k=2) #look at a specific tree
##    left daughter right daughter split var split point status prediction
## 1              2              3         4        0.80      1          0
## 2              0              0         0        0.00     -1          1
## 3              4              5         3        4.85      1          0
## 4              6              7         1        4.95      1          0
## 5              8              9         4        1.75      1          0
## 6              0              0         0        0.00     -1          3
## 7              0              0         0        0.00     -1          2
## 8             10             11         4        1.60      1          0
## 9              0              0         0        0.00     -1          3
## 10             0              0         0        0.00     -1          3
## 11             0              0         0        0.00     -1          2
irisP <- classCenter(training[,c(3,4)], training$Species, modFit$finalModel$prox) #take only `Petal.Length` and `Petal.Width` in irisP
irisP <- as.data.frame(irisP); irisP$Species <- rownames(irisP)
p <- qplot(Petal.Width, Petal.Length, col=Species,data=training)
p + geom_point(aes(x=Petal.Width,y=Petal.Length,col=Species),size=5,shape=4,data=irisP)

pred <- predict(modFit,testing) 
testing$predRight <- pred==testing$Species
table(pred,testing$Species)
##             
## pred         setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         12         0
##   virginica       0          3        15
qplot(Petal.Width,Petal.Length,colour=predRight,data=testing,main="newdata Predictions")

Boosting

Along with random forest, boosting also produces very accurate prediction.

Key idea

  1. Take lots of (possibly) weak models
  • take k classifiers typically from the same family (e.g. all possible trees, regression models, cutoff points)
  1. Weight the models to take advantage of their strengths and add them up
  2. You end up with a stronger model

The function is a weighted average of models.

The goal of minimized error on the training set.

The algorithm is to use iterative methods to select one classifier at each step, calculate the weights based on error, and upweight missed classifications and select next classifer.

One large subclass of boosting is gradient boosting. * gbm - boosting with trees * mboost - model based boosting * ada - statistical boosting based on additive logitstic regression

Adaboost is perhaps the most famous boosting method.
Adaboost on Wikipedia

Helpful boosting tutorial

Example of Boosting

library(ISLR); data(Wage); library(ggplot2); library(caret); 
Wage<-subset(Wage, select=-c(logwage))
inTrain<-createDataPartition(y=Wage$wage, p=0.7, list=FALSE)
training<-Wage[inTrain, ]; testing<-Wage[-inTrain, ]

#"gbm" is boosting with trees; verbose=FALSE gets ride of many intermediate output
modFit<-train(wage~., method="gbm", data=training, verbose=FALSE)
#print(modFit) 

qplot(predict(modFit, testing), wage, data=testing)

Parametric Model (Bayesian update) Prediction

Key idea

  1. Assume the data follow a probalistic model.
    For example, if the outcome variable is count data, poisson distribution is appropriate.

  2. Use Bayes’ theorem (probability of y in a particular class given the predictors taking on specific values equals to a fraction of the top and the bottom; the numerator being the probability of the features given y in the class times probablity of y in the class; the denominator is the probability of the features) to identify optimal classifiers based on the assumption of the model.

\[Pr(Y=k|X=x) = \frac{Pr(X=x|Y=k)Pr(Y=k)}{\sum_{\ell=1}^K Pr(X=x|Y=\ell)Pr(Y=\ell)}\]

It is written as:
\[Pr(Y=k|X=x) = \frac{f_k(x) \pi_k}{\sum_{\ell = 1}^K f_{\ell}(x) \pi_{\ell}}\] Let’s write out the multivariate model (many regressors):
\[P(Y = k | X_1,\ldots,X_m) = \frac{\pi_k P(X_1,\ldots,X_m| Y=k)}{\sum_{\ell = 1}^K P(X_1,\ldots,X_m | Y=\ell) \pi_{\ell}}\]

Prior probabilities (\(f_k(x)\)) are specified in advance. A common choice is Gaussian (two parameters, mean and variance; for simplicity, assume off-diagonal covariance matrix elements are zero). Estimate the parameters from the data. Make prediction of y based on the calculated probabilities \(Pr(Y=k|X=x)\) - category according to the highest probability.

Notice that for all observations, the denominator is the same. Hence, maximization of the numerator is equivalent to maximization of the probability.

\[ \propto \pi_k P(X_1,\ldots,X_m| Y=k)\] We apply the Bayes theorem iteratively on each of the \(X_1, X_2..., X_m\) and have the following: \[P(X_1,\ldots,X_m, Y=k) = \pi_k P(X_1 | Y = k)P(X_2,\ldots,X_m | X_1,Y=k)\] \[ = \pi_k P(X_1 | Y = k) P(X_2 | X_1, Y=k) P(X_3,\ldots,X_m | X_1,X_2, Y=k)\] \[ = \pi_k P(X_1 | Y = k) P(X_2 | X_1, Y=k)\ldots P(X_m|X_1\ldots,X_{m-1},Y=k)\]

Several popular models:
1. linear discriminant analysis: multivariate Gaussian with same covariances
2. quadratic discriminant analysis: multivariate Gaussian with different covariances
3. naive Bayes: assumes independent predictors (features are independent)

Naive Bayes is to assume independence across all Xs, which will allow us write out the probability terms as multiplcative of single feature given a classification.

\[ \approx \pi_k P(X_1 | Y = k) P(X_2 | Y = k)\ldots P(X_m |,Y=k)\]

Copyright © 2019 Cathy Gao at cathygao.2019@outlook.com.