Background of Class Project

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants.

Data Source and Documentation

Training Data - csv file

Test Data - csv file

Data Resource (website)

Goal

The goal of your project is to predict the manner in which they did the exercise.

This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.

Project - Part I. Load Data

There are two datasets: the train set and the test set which are directly downloadable online.

The train dataset has 19622 observations which is the dataset we use to build our prediction model. We will divide the train dataset into a training set and a test set in the analysis to assess the performance of the models before selecting the best model.

The test dataset has 20 observations to test our selected model, which will not be used until the end of the class project.

setwd("C:/Users/risin/PracticalMachineLearning") #getwd()
rm(list=ls())

URL<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
if (!file.exists("train.csv")) {
    download.file(URL, "./train.csv")
}

if (!file.exists("test.csv")) {
    download.file(url="https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", destfile="./test.csv")
}

download.time<-Sys.time()

training.Pre<-read.csv("./train.csv") # use na.strings argument to be interpreted as NAs
prediction.set<-read.csv("./test.csv")

dim(training.Pre); dim(prediction.set)
## [1] 19622   160
## [1]  20 160

Part II. Exploratory Analysis

The variable we try to predict “classe”, the last variable in the train dataset, is a factor varible that has five levels. It has no missing values.

After looking at the structure of the train dataset, we see that the first 5 variables (which are X, user_names and timestamp) are unrelated in prediction. So, we do not include them in the final dataset.

Then, we randomly split the dataset into training and testing sets. The typically picked proportion is 70% used for the training set and 30% for the testing set.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
str(training.Pre)
## 'data.frame':    19622 obs. of  160 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name               : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp          : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ new_window              : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_window              : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt              : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt        : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ kurtosis_roll_belt      : Factor w/ 397 levels "","-0.016850",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_picth_belt     : Factor w/ 317 levels "","-0.021887",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_yaw_belt       : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_belt      : Factor w/ 395 levels "","-0.003095",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_belt.1    : Factor w/ 338 levels "","-0.005928",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_yaw_belt       : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_belt            : Factor w/ 68 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ min_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_belt            : Factor w/ 68 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ amplitude_roll_belt     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_belt    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_belt      : Factor w/ 4 levels "","#DIV/0!","0.00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ var_total_accel_belt    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_belt        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_belt       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_belt         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_belt_x            : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y            : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z            : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x            : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y            : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z            : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x           : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y           : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z           : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm                : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm               : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm                 : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm         : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ var_accel_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_arm         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_arm          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_arm_x             : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y             : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z             : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x             : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y             : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z             : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x            : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y            : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z            : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ kurtosis_roll_arm       : Factor w/ 330 levels "","-0.02438",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_picth_arm      : Factor w/ 328 levels "","-0.00484",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_yaw_arm        : Factor w/ 395 levels "","-0.01548",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_arm       : Factor w/ 331 levels "","-0.00051",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_pitch_arm      : Factor w/ 328 levels "","-0.00184",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_yaw_arm        : Factor w/ 395 levels "","-0.00311",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_arm     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_arm       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ roll_dumbbell           : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell          : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell            : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ kurtosis_roll_dumbbell  : Factor w/ 398 levels "","-0.0035","-0.0073",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_picth_dumbbell : Factor w/ 401 levels "","-0.0163","-0.0233",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_yaw_dumbbell   : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_dumbbell  : Factor w/ 401 levels "","-0.0082","-0.0096",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_pitch_dumbbell : Factor w/ 402 levels "","-0.0053","-0.0084",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_yaw_dumbbell   : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_dumbbell        : Factor w/ 73 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ min_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_dumbbell        : Factor w/ 73 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ amplitude_roll_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##   [list output truncated]
summary(training.Pre$classe)
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607
training.Pre<-training.Pre[, 6:ncol(training.Pre)]

set.seed(12345)

inTrain<-createDataPartition(training.Pre$classe, p=0.7, list=FALSE)

training<-training.Pre[inTrain, ]

testing<-training.Pre[-inTrain, ]

A. Missing Values

Out of 160 variables in the training set, 62 variables have all NAs and the remaining 88 have no missing values. The case is clear cut and we will remove the variables that have no information for prediction. The number of predictors is now reduced to 88 variables in the training set.

table(colSums(is.na(training)))
## 
##     0 13454 
##    88    67
training<-training[, colSums(is.na(training))==0]

B. Remove Near Zero Variance Predictors

When a predictor has only one or few unique values relative to the number of observations in the dataset, it does not have enough variation to explain the outcome variable. When our purpose is to make accurate prediction, we can remove those variables from the dataset. The number of variables is now down to 54 which includes the outcome variable “classe”.

nsv<-nearZeroVar(training, saveMetrics=TRUE)
keep<-nsv$nzv==FALSE

training<-training[, row.names(nsv)[keep]]

Part III. Model Building

We use k-fold cross validation to evaluate the performance of the chosen models. The choice of k is a tradeoff between bias and variance. Typically, k is set to be 5 or 10 since these values have been shown empirically to yield test error rate that is neither excessively biased nor highly varied. For faster running time, we pick k to be 5.

control<-trainControl(method="repeatedcv", number=5, repeats=3)
metric<-"Accuracy"

Model 1: Random Forest

Our goal is to predict the outcome variable “classe”. It has 5 levels. Classification trees method is a natural choice under such a nonlinear setting and it is shown to be highly accurate. The model yields an accuracy rate on the testing set 0.999 and it took 30 minutes to run on the computer.

set.seed(12345)
Model_Fit<-train(classe~., data=training, method="rf", trControl=control, metric=metric)
Model_Fit$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.18%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3905    1    0    0    0 0.0002560164
## B    6 2650    2    0    0 0.0030097818
## C    0    3 2392    1    0 0.0016694491
## D    0    0    9 2243    0 0.0039964476
## E    0    0    1    2 2522 0.0011881188
modelrf<-confusionMatrix(testing$classe, predict(Model_Fit, testing))

Model 2: Gradient Boosting

The final model of gbm algorithm has an accuracy rate of 0.986 on the testing set, which is still very high, but not as high as the random forest model. The time to run is shorter, about 15 minutes.

set.seed(12345)
Model_alt_Fit<-train(classe~., data=training, method="gbm", trControl=control, metric=metric, verbose=FALSE)
Model_alt_Fit$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 53 predictors of which 53 had non-zero influence.
modelgbm<-confusionMatrix(testing$classe, predict(Model_alt_Fit, testing))

Compare the Two Models

We fit the two models with the same version of the training set. Through resampling the training set, we compare the predictive accuracy rates of the two algorithems. From the table with different quantiles of the estimates and the graph, it is very clear that the random forest model performs better in all aspects.

results<-resamples(list(rf=Model_Fit, gbm=Model_alt_Fit))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: rf, gbm 
## Number of resamples: 15 
## 
## Accuracy 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## rf  0.9952693 0.9965410 0.9967249 0.9969668 0.9976346 0.9985444    0
## gbm 0.9810703 0.9830693 0.9843466 0.9845672 0.9868948 0.9876274    0
## 
## Kappa 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## rf  0.9940156 0.9956243 0.9958580 0.9961632 0.9970084 0.9981589    0
## gbm 0.9760537 0.9785813 0.9802013 0.9804779 0.9834219 0.9843482    0
dotplot(results)

We examine the two methods on the testing set. The prediction tables and accuracy rates show the same thing: random forest provides more accurate results.

modelrf$table; modelgbm$table
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    0    0    0    0
##          B    1 1138    0    0    0
##          C    0    2 1024    0    0
##          D    0    0    2  962    0
##          E    0    0    0    1 1081
##           Reference
## Prediction    A    B    C    D    E
##          A 1666    7    0    1    0
##          B   13 1112   14    0    0
##          C    0   11 1013    2    0
##          D    0    5   20  939    0
##          E    0    4    1    6 1071
modelrf$overall[1:2]; modelgbm$overall[1:2]
##  Accuracy     Kappa 
## 0.9989805 0.9987104
##  Accuracy     Kappa 
## 0.9857264 0.9819434

Stacking Models

Given that the two models have very high accuracy rates, the combined model shall yield even better predictive power. We then stack the two models together.

First, we want to check the correlation between the two methods. A lower correlation is preferred since there is additional independent information in each of the models. The correlation is calculated at 0.428 which is well below 0.75 (the commonly recognized threshold of “a high correlation”). It implies that the two models caputures different variations from the same dataset.

modelCor(results)
##            rf       gbm
## rf  1.0000000 0.4284244
## gbm 0.4284244 1.0000000
splom(results)

Then, we prepare and stack the two models together. The combined model does improve on the random forest method. However, the improvement is only marginal, from 0.999 to 0.9992 which is almost negligible.

We will use the combined model and the random forest model to predict the 20 cases in the test set. The difference is expected to be minimal.

pred_rf<-predict(Model_Fit, testing)
pred_gbm<-predict(Model_alt_Fit, testing)

predDF<-data.frame(pred_rf, pred_gbm, classe=testing$classe)

set.seed(12345)
combModFit<-train(classe~., method="rf", data=predDF, trControl=control, metric=metric)

confusionMatrix(testing$classe, predict(combModFit, predDF))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    0    0    0    0
##          B    1 1138    0    0    0
##          C    0    2 1024    0    0
##          D    0    0    2  962    0
##          E    0    0    0    0 1082
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9992         
##                  95% CI : (0.998, 0.9997)
##     No Information Rate : 0.2846         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9989         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9982   0.9981   1.0000   1.0000
## Specificity            1.0000   0.9998   0.9996   0.9996   1.0000
## Pos Pred Value         1.0000   0.9991   0.9981   0.9979   1.0000
## Neg Pred Value         0.9998   0.9996   0.9996   1.0000   1.0000
## Prevalence             0.2846   0.1937   0.1743   0.1635   0.1839
## Detection Rate         0.2845   0.1934   0.1740   0.1635   0.1839
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9997   0.9990   0.9988   0.9998   1.0000

Part IV. Predict the test set (20 cases)

As expected, the random forest model and the combined model offer the identical predictions on the 20 cases.

pred1<-predict(Model_Fit, prediction.set)

pred2<-predict(Model_alt_Fit, prediction.set)

predVDF<-data.frame(pred_rf=pred1, pred_gbm=pred2)
combpred<-predict(combModFit, predVDF)

#compare the prediction results of random forest and the stack model
sum(predVDF$pred_rf %in% combpred)
## [1] 20

Appendix

PCA doesn’t help with model accuracy

In order to predict the 20 cases correctly and assuming independence of the cases, the method shall have an accuracy rate of 0.995 so that the probability that all 20 cases are correct to be 90%. We are aiming at an accuracy rate of at least 0.99.

Given that we have 53 predictors after removing the near zero variance variables, we try to further reduce the predictor set to imrpove on machine time without much loss of accuracy.

We examine the correlation matrix of the remaining predictors. There are 15 pairs that are highly correlated out of the 1378 (53*52/2) pairs in total. It is not a significant proportion (only about 1%). We still decide to use the principal component analysis to reduce the number of predictors. The upside is the model run time will be reduced with a smaller number of variables that are as independent as possible.

M<-abs(cor(training[, -54]))
diag(M)<-0
highcorr<-which(M>0.8, arr.ind=TRUE)
nrow(highcorr)

From PCA analysis, we find that 27 components are needed to capture 95% of the variance. The number of variables is further reduced from 53 to 27. The model building will be based on the PCA results.

preProc<-preProcess(training[, -54], method="pca") #can set argument "pcaComp=30"

trainPC<-predict(preProc, training[, -54])

testPC<-predict(preProc, testing[, -155]) #classe is the last vbl

Based on the 27 principal components, we run the random forest model. The accuracy of the final model is 0.977. It takes about 15 minutes to train the model.

When the number of PCA is increased to 30 (which captures 97.5% of the variation), the model accuracy doesn’t improve much, 0.978 after retraining the model.

set.seed(12345)

modFit<-train(x=trainPC, y=training$classe, method="rf", trControl=control, metric=metric)

modFit$finalModel

confusionMatrix(testing$classe, predict(modFit, testPC))

We try the gbm as well. The accuracy of the final model is 0.814.

sed.seed(12345)
modFit_alt<-train(x=trainPC, y=training$classe, method="gbm", trControl=control, metric=metric, verbose=FALSE)

modFit_alt$finalModel

confusionMatrix(testing$classe, predict(modFit_alt, testPC))

SVM Model

Here is the support vector machine method. It takes only 5 minutes to run the model and has an accuracy rate of 0.947 in the test set. It is inferior to the random forest and the gradient boosting methods. Hence, it is not used.

We need to make changes to the test set since the svm function requires the training and the testing set to be of the same structure.

set.seed(12345)

library(e1071)

testingsvm<-testing[, row.names(nsv)[keep]]
dim(training); dim(testingsvm)

svmfit<-svm(classe~., data=training)
svmpred<-predict(svmfit, testing[, -54])

confusionMatrix(testing$classe, svmpred)