Week3quiz

Question 1

Load the cell segmentation data from the AppliedPredictiveModeling package using the commands:

library(AppliedPredictiveModeling)
data(segmentationOriginal)
library(caret)

Subset the data to a training set and testing set based on the Case variable in the dataset.
Set the seed to 125 adn fit a CART model to predict Class with the rpart method using all predictor variables and default caret settings.
In the final model what would be the prediction for cases with the following variable

TotalIntench2=23,000; FiberWidthCh1=10, PerimStatusCh1=2
TotalIntench2=50,000; FiberWidthCh1=10, PerimStatusCh1=100
TotalIntench2=57,000; FiberWidthCh1=8, PerimStatusCh1=100
TotalIntench2=8; FiberWidthCh1=100, PerimStatusCh1=2

Tip: plot the resulting tree and use the plot to answer the question.

set.seed(125)
#inTrain<-createDataPartition(y=segmentationOriginal$Class, p=0.7, list=FALSE)
inTrain<-segmentationOriginal$Case
training<-segmentationOriginal[inTrain=="Train", ]
testing<-segmentationOriginal[inTrain=="Test", ]
dim(training); dim(testing)

modFit<-train(Class~., method="rpart", data=training)
print(modFit$finalModel)

plot(modFit$finalModel, uniform=TRUE, main="Classification Tree") # dendogram
text(modFit$finalModel, use.n=TRUE, all=TRUE, cex=0.8)

library(rattle)
fancyRpartPlot(modFit$finalModel)

Question 3

Load the olive oil data using the commands:

library(pgmm)
data(olive)
olive = olive[,-1]

These data contain information on 572 different Italian olive oils from multiple regions in Italy. Fit a classification tree where Area is the outcome variable. Then predict the value of area for the following data frame using the tree command with all defaults.

modFit<-train(Area~., method="rpart", data=olive)
print(modFit$finalModel)

newdata = as.data.frame(t(colMeans(olive)))
predict(modFit, newdata=newdata)

What is the resulting prediction? Is the resulting prediction strange? Why or why not?

Answer: 2.783. It is strange because Area should be a qualitative variable - but tree is reporting the average value of Area as a numeric variable in the leaf predicted for newdata.

Note: we can change the Area variable to a factor variable to avoid the above issue.

Question 4

Load the South Africa Heart Disease Data and create training and test sets with the following code:

library(ElemStatLearn)
data(SAheart)
set.seed(8484)
train = sample(1:dim(SAheart)[1],size=dim(SAheart)[1]/2,replace=F)
trainSA = SAheart[train,]
testSA = SAheart[-train,]

Then set the seed to 13234 and fit a logistic regression model (method=“glm”, be sure to specify family=“binomial”) with Coronary Heart Disease (chd) as the outcome and age at onset, current alcohol consumption, obesity levels, cumulative tabacco, type-A behavior, and low density lipoprotein cholesterol as predictors. Calculate the misclassification rate for your model using this function and a prediction on the “response” scale:

set.seed(13234)

modFit<-glm(as.factor(chd)~age+alcohol+obesity+tobacco+typea+ldl, family=binomial, data=trainSA)
summary(modFit)

predict1<-predict(modFit, newdata=testSA)
predict2<-predict(modFit, newdata=trainSA)

missClass = function(values,prediction){sum(((prediction > 0.5)*1) != values)/length(values)}

missClass(testSA$chd, predict1)
missClass(trainSA$chd, predict2)

What is the misclassification rate on the training set? What is the misclassification rate on the test set?

Answer: Test set misclassion: 0.31 and training set: 0.27.

Question 5

Load the vowel.train and vowel.test data sets:

library(ElemStatLearn)
data(vowel.train)
data(vowel.test)

Set the variable y to be a factor variable in both the training and test set. Then set the seed to 33833. Fit a random forest predictor relating the factor variable y to the remaining variables.

Read about variable importance in random forests here: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr

The caret package uses by default the Gini importance.

Calculate the variable importance using the varImp function in the caret package. What is the order of variable importance?

[NOTE: Use randomForest() specifically, not caret, as there’s been some issues reported with that approach. 11/6/2016]

set.seed(33833)

# caret package
modFit<-train(y~., data=vowel.train, method="rf", prox=TRUE)
modFit$finalModel$importance

# randomForest function 
modFit2<-randomForest(y~., data=vowel.train, importance=TRUE)
importance(modFit2)
varImp(modFit2)

Answer: the order of the variables is: x.2, x.1, x.5, x.6, x.8, x.4, x.9, x.3, x.7, x.10

Week3quiz_codes

Question 1

Question 3

Question 4

Question 5