Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants.
The goal of your project is to predict the manner in which they did the exercise.
This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.
There are two datasets: the train set and the test set which are directly downloadable online.
The train dataset has 19622 observations which is the dataset we use to build our prediction model. We will divide the train dataset into a training set and a test set in the analysis to assess the performance of the models before selecting the best model.
The test dataset has 20 observations to test our selected model, which will not be used until the end of the class project.
setwd("C:/Users/risin/PracticalMachineLearning") #getwd()
rm(list=ls())
URL<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
if (!file.exists("train.csv")) {
download.file(URL, "./train.csv")
}
if (!file.exists("test.csv")) {
download.file(url="https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", destfile="./test.csv")
}
download.time<-Sys.time()
training.Pre<-read.csv("./train.csv") # use na.strings argument to be interpreted as NAs
prediction.set<-read.csv("./test.csv")
dim(training.Pre); dim(prediction.set)
## [1] 19622 160
## [1] 20 160
The variable we try to predict “classe”, the last variable in the train dataset, is a factor varible that has five levels. It has no missing values.
After looking at the structure of the train dataset, we see that the first 5 variables (which are X, user_names and timestamp) are unrelated in prediction. So, we do not include them in the final dataset.
Then, we randomly split the dataset into training and testing sets. The typically picked proportion is 70% used for the training set and 30% for the testing set.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
str(training.Pre)
## 'data.frame': 19622 obs. of 160 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ raw_timestamp_part_1 : int 1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2 : int 788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
## $ cvtd_timestamp : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ new_window : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ num_window : int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ kurtosis_roll_belt : Factor w/ 397 levels "","-0.016850",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_picth_belt : Factor w/ 317 levels "","-0.021887",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_yaw_belt : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_belt : Factor w/ 395 levels "","-0.003095",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_belt.1 : Factor w/ 338 levels "","-0.005928",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_yaw_belt : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_belt : Factor w/ 68 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ min_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_belt : Factor w/ 68 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ amplitude_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_belt : Factor w/ 4 levels "","#DIV/0!","0.00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ var_total_accel_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ var_accel_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ kurtosis_roll_arm : Factor w/ 330 levels "","-0.02438",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_picth_arm : Factor w/ 328 levels "","-0.00484",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_yaw_arm : Factor w/ 395 levels "","-0.01548",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_arm : Factor w/ 331 levels "","-0.00051",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_pitch_arm : Factor w/ 328 levels "","-0.00184",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_yaw_arm : Factor w/ 395 levels "","-0.00311",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ max_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ kurtosis_roll_dumbbell : Factor w/ 398 levels "","-0.0035","-0.0073",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_picth_dumbbell : Factor w/ 401 levels "","-0.0163","-0.0233",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_yaw_dumbbell : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_dumbbell : Factor w/ 401 levels "","-0.0082","-0.0096",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_pitch_dumbbell : Factor w/ 402 levels "","-0.0053","-0.0084",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_yaw_dumbbell : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_dumbbell : Factor w/ 73 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ min_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_dumbbell : Factor w/ 73 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ amplitude_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## [list output truncated]
summary(training.Pre$classe)
## A B C D E
## 5580 3797 3422 3216 3607
training.Pre<-training.Pre[, 6:ncol(training.Pre)]
set.seed(12345)
inTrain<-createDataPartition(training.Pre$classe, p=0.7, list=FALSE)
training<-training.Pre[inTrain, ]
testing<-training.Pre[-inTrain, ]
Out of 160 variables in the training set, 62 variables have all NAs and the remaining 88 have no missing values. The case is clear cut and we will remove the variables that have no information for prediction. The number of predictors is now reduced to 88 variables in the training set.
table(colSums(is.na(training)))
##
## 0 13454
## 88 67
training<-training[, colSums(is.na(training))==0]
When a predictor has only one or few unique values relative to the number of observations in the dataset, it does not have enough variation to explain the outcome variable. When our purpose is to make accurate prediction, we can remove those variables from the dataset. The number of variables is now down to 54 which includes the outcome variable “classe”.
nsv<-nearZeroVar(training, saveMetrics=TRUE)
keep<-nsv$nzv==FALSE
training<-training[, row.names(nsv)[keep]]
We use k-fold cross validation to evaluate the performance of the chosen models. The choice of k is a tradeoff between bias and variance. Typically, k is set to be 5 or 10 since these values have been shown empirically to yield test error rate that is neither excessively biased nor highly varied. For faster running time, we pick k to be 5.
control<-trainControl(method="repeatedcv", number=5, repeats=3)
metric<-"Accuracy"
Our goal is to predict the outcome variable “classe”. It has 5 levels. Classification trees method is a natural choice under such a nonlinear setting and it is shown to be highly accurate. The model yields an accuracy rate on the testing set 0.999 and it took 30 minutes to run on the computer.
set.seed(12345)
Model_Fit<-train(classe~., data=training, method="rf", trControl=control, metric=metric)
Model_Fit$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.18%
## Confusion matrix:
## A B C D E class.error
## A 3905 1 0 0 0 0.0002560164
## B 6 2650 2 0 0 0.0030097818
## C 0 3 2392 1 0 0.0016694491
## D 0 0 9 2243 0 0.0039964476
## E 0 0 1 2 2522 0.0011881188
modelrf<-confusionMatrix(testing$classe, predict(Model_Fit, testing))
The final model of gbm algorithm has an accuracy rate of 0.986 on the testing set, which is still very high, but not as high as the random forest model. The time to run is shorter, about 15 minutes.
set.seed(12345)
Model_alt_Fit<-train(classe~., data=training, method="gbm", trControl=control, metric=metric, verbose=FALSE)
Model_alt_Fit$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 53 predictors of which 53 had non-zero influence.
modelgbm<-confusionMatrix(testing$classe, predict(Model_alt_Fit, testing))
We fit the two models with the same version of the training set. Through resampling the training set, we compare the predictive accuracy rates of the two algorithems. From the table with different quantiles of the estimates and the graph, it is very clear that the random forest model performs better in all aspects.
results<-resamples(list(rf=Model_Fit, gbm=Model_alt_Fit))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: rf, gbm
## Number of resamples: 15
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rf 0.9952693 0.9965410 0.9967249 0.9969668 0.9976346 0.9985444 0
## gbm 0.9810703 0.9830693 0.9843466 0.9845672 0.9868948 0.9876274 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## rf 0.9940156 0.9956243 0.9958580 0.9961632 0.9970084 0.9981589 0
## gbm 0.9760537 0.9785813 0.9802013 0.9804779 0.9834219 0.9843482 0
dotplot(results)
We examine the two methods on the testing set. The prediction tables and accuracy rates show the same thing: random forest provides more accurate results.
modelrf$table; modelgbm$table
## Reference
## Prediction A B C D E
## A 1674 0 0 0 0
## B 1 1138 0 0 0
## C 0 2 1024 0 0
## D 0 0 2 962 0
## E 0 0 0 1 1081
## Reference
## Prediction A B C D E
## A 1666 7 0 1 0
## B 13 1112 14 0 0
## C 0 11 1013 2 0
## D 0 5 20 939 0
## E 0 4 1 6 1071
modelrf$overall[1:2]; modelgbm$overall[1:2]
## Accuracy Kappa
## 0.9989805 0.9987104
## Accuracy Kappa
## 0.9857264 0.9819434
Given that the two models have very high accuracy rates, the combined model shall yield even better predictive power. We then stack the two models together.
First, we want to check the correlation between the two methods. A lower correlation is preferred since there is additional independent information in each of the models. The correlation is calculated at 0.428 which is well below 0.75 (the commonly recognized threshold of “a high correlation”). It implies that the two models caputures different variations from the same dataset.
modelCor(results)
## rf gbm
## rf 1.0000000 0.4284244
## gbm 0.4284244 1.0000000
splom(results)
Then, we prepare and stack the two models together. The combined model does improve on the random forest method. However, the improvement is only marginal, from 0.999 to 0.9992 which is almost negligible.
We will use the combined model and the random forest model to predict the 20 cases in the test set. The difference is expected to be minimal.
pred_rf<-predict(Model_Fit, testing)
pred_gbm<-predict(Model_alt_Fit, testing)
predDF<-data.frame(pred_rf, pred_gbm, classe=testing$classe)
set.seed(12345)
combModFit<-train(classe~., method="rf", data=predDF, trControl=control, metric=metric)
confusionMatrix(testing$classe, predict(combModFit, predDF))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 0 0 0 0
## B 1 1138 0 0 0
## C 0 2 1024 0 0
## D 0 0 2 962 0
## E 0 0 0 0 1082
##
## Overall Statistics
##
## Accuracy : 0.9992
## 95% CI : (0.998, 0.9997)
## No Information Rate : 0.2846
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9989
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9982 0.9981 1.0000 1.0000
## Specificity 1.0000 0.9998 0.9996 0.9996 1.0000
## Pos Pred Value 1.0000 0.9991 0.9981 0.9979 1.0000
## Neg Pred Value 0.9998 0.9996 0.9996 1.0000 1.0000
## Prevalence 0.2846 0.1937 0.1743 0.1635 0.1839
## Detection Rate 0.2845 0.1934 0.1740 0.1635 0.1839
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.9997 0.9990 0.9988 0.9998 1.0000
As expected, the random forest model and the combined model offer the identical predictions on the 20 cases.
pred1<-predict(Model_Fit, prediction.set)
pred2<-predict(Model_alt_Fit, prediction.set)
predVDF<-data.frame(pred_rf=pred1, pred_gbm=pred2)
combpred<-predict(combModFit, predVDF)
#compare the prediction results of random forest and the stack model
sum(predVDF$pred_rf %in% combpred)
## [1] 20
In order to predict the 20 cases correctly and assuming independence of the cases, the method shall have an accuracy rate of 0.995 so that the probability that all 20 cases are correct to be 90%. We are aiming at an accuracy rate of at least 0.99.
Given that we have 53 predictors after removing the near zero variance variables, we try to further reduce the predictor set to imrpove on machine time without much loss of accuracy.
We examine the correlation matrix of the remaining predictors. There are 15 pairs that are highly correlated out of the 1378 (53*52/2) pairs in total. It is not a significant proportion (only about 1%). We still decide to use the principal component analysis to reduce the number of predictors. The upside is the model run time will be reduced with a smaller number of variables that are as independent as possible.
M<-abs(cor(training[, -54]))
diag(M)<-0
highcorr<-which(M>0.8, arr.ind=TRUE)
nrow(highcorr)
From PCA analysis, we find that 27 components are needed to capture 95% of the variance. The number of variables is further reduced from 53 to 27. The model building will be based on the PCA results.
preProc<-preProcess(training[, -54], method="pca") #can set argument "pcaComp=30"
trainPC<-predict(preProc, training[, -54])
testPC<-predict(preProc, testing[, -155]) #classe is the last vbl
Based on the 27 principal components, we run the random forest model. The accuracy of the final model is 0.977. It takes about 15 minutes to train the model.
When the number of PCA is increased to 30 (which captures 97.5% of the variation), the model accuracy doesn’t improve much, 0.978 after retraining the model.
set.seed(12345)
modFit<-train(x=trainPC, y=training$classe, method="rf", trControl=control, metric=metric)
modFit$finalModel
confusionMatrix(testing$classe, predict(modFit, testPC))
We try the gbm as well. The accuracy of the final model is 0.814.
sed.seed(12345)
modFit_alt<-train(x=trainPC, y=training$classe, method="gbm", trControl=control, metric=metric, verbose=FALSE)
modFit_alt$finalModel
confusionMatrix(testing$classe, predict(modFit_alt, testPC))
Here is the support vector machine method. It takes only 5 minutes to run the model and has an accuracy rate of 0.947 in the test set. It is inferior to the random forest and the gradient boosting methods. Hence, it is not used.
We need to make changes to the test set since the svm
function requires the training and the testing set to be of the same structure.
set.seed(12345)
library(e1071)
testingsvm<-testing[, row.names(nsv)[keep]]
dim(training); dim(testingsvm)
svmfit<-svm(classe~., data=training)
svmpred<-predict(svmfit, testing[, -54])
confusionMatrix(testing$classe, svmpred)