library(caret)
train.url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test.url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

OVERVIEW

This HAR analysis aims to predict the type of activity performed by subjects wearing a wearable computing device. The type of activities fall into 5 classes:

The dataset provided by this project will be used to train a few classification models to attempt to determine the likelihood of the activity. The results of each model will be evaluated against a separate cross validation (testing) dataset that has not been included in training the models.

DATA PREVIEW

We begin by reading in the raw data provided from the project. The dataset has 159 potential predictors for the outcome variable “classe”. The “test” set here is the target set where the outcome is unknown and we want to use our best solution to predict on this set. This set will need to eliminate the same variables as the raw dataset.

raw <- read.csv(train.url, header = T, na.strings = c(""," ","#DIV/0!","NA"))
test <- read.csv(test.url, header = T, na.strings = c(""," ","#DIV/0!","NA"))
dim(raw)
## [1] 19622   160

We now explore these potential predictors to evaluate their suitability for inclusion in the classification exercise. We see that the first 5 columns consist of subject ids, names, and activity timestamps. We eliminate these because they aren’t likely contributors to the models. Further, since we want to train our models on as complete a dataset as possible, we conduct an exercise to only keep features where at least 75% of the values are known.

raw <- raw[,-c(1:5)]
test <- test[,-c(1:5)]
raw <- raw[, -which(colSums(is.na(raw))/nrow(raw) > 0.75)]
test <- test[, -which(colSums(is.na(test))/nrow(test) > 0.75)]

dim(raw)
## [1] 19622    55

Quite a few variables were eliminated because of extremely sparse data.

Data Partitioning

We’ll split the raw dataset into the train and test (cross validation, “cv”) dataset. This’ll ensure that the same features are used in training as well as during validation, while the model is trained only on the training set.

set.seed(31415)
trainInds <- createDataPartition(raw$classe, p=.7, list = F)
train <- raw[trainInds, ] 
cv <- raw[-trainInds, ]

In order to get an idea of the dumbell orientation features with the outcome, let’s look at a scatterplot of the Total Measurements with the Classe outcome.

We see that each of the total measurements have small correlations to the Classe outcome. For e.g. while the totalaccelbelt measurement seems to have a lot of data indicating classe A, it’s not so well correlated with the other classes.

PRE-PROCESSING, BUILDING AND TUNING MODELS

We now apply various classification algorithms to our training dataset. Each training model will be run in a parallelized fashion with tuning parameters that enable multiple cross fold validation. This will help determine the accuracy of the final model.

Pre-Processing Note: I pre-process the training set by centering and scaling before training each model. This is to account for the differences in scales between some features in teh dataset. Moreover, features, especially discrete types, that exhibit near zero variability will not inform the analysis. It’s therefore prudent to remove these features from the analysis, in addition to centering and scaling the training sets.

Model 1: Recursive Partitioning

We’ll start by using a classification tree to try and find the decision points in the dataset.The aim here is to try and find the decision points that best split the data until we arrive at homogenous activities. We use repeated cross validation as our accuracy metric to arrive at average predictions on multiple runs.

# enable parallelization 
library(doMC)
registerDoMC(cores=8)
system.time(
        rpart.fit <- train(classe~., data=train, preProcess=c('center','scale','nzv'), 
                           method="rpart",
                           trControl = trainControl(
                                   method='repeatedcv',
                                   number = 10,
                                   repeats = 3,
                                   allowParallel = TRUE
                           ))
)
##    user  system elapsed 
##  66.159   4.155  15.827
rpart.fit
## CART 
## 
## 13737 samples
##    54 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (53), scaled (53), remove (1) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 12363, 12364, 12363, 12363, 12363, 12364, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa       Accuracy SD  Kappa SD  
##   0.03143119  0.5876567  0.47416089  0.03490112   0.04524788
##   0.04919811  0.4487083  0.26122078  0.08233405   0.13435496
##   0.11463737  0.3180258  0.05167899  0.03925844   0.06015374
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.03143119.

We see Recursive Partitioning has a very poor classification accuracy of 58.7656737% on the training set. Let’s now attempt a more time and process intensive random traverse approach.

Model 2: Random Forest

We’ll next run a random forest classifier to try and improve the accuracy of the partitioning algorithm. Perhaps randomizing the walk down various trees in teh dataset will offer better insights into the partitioning.

library(doMC)
registerDoMC(cores = 8)
system.time(
        rf.fit <- train(classe~., data=train, preProcess=c('center','scale','nzv'),
                        method="rf", 
                        trControl = trainControl(
                                method = "repeatedcv",
                                number = 5,
                                repeats = 5,
                               allowParallel = TRUE
                        )
        )
)
##     user   system  elapsed 
## 4395.296   38.526  730.785
rf.fit
## Random Forest 
## 
## 13737 samples
##    54 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (53), scaled (53), remove (1) 
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 10988, 10988, 10991, 10990, 10991, 10990, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9941328  0.9925783  0.001555464  0.001967966
##   28    0.9967825  0.9959302  0.001205562  0.001524911
##   54    0.9953410  0.9941068  0.001405872  0.001778254
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 28.

The random forest’s accuracy is a big improvement over the partitioning algorithm run earlier. The accuracy behaved thus on upon repeated cross validation:

From this, and taking the average across all runs, we get an expected accuracy of 99.5418766% from the random forest algorithm. This gives us an Expected Out of Sample error rate: 0.0045812%.

Model 3: Support Vector Machine

Now, since we have a good chunk of data, with > 10-20 features, a Support Vector Machine classifier might help in determining a good classification boundary between the various outcome classes. The support vector machine improves on logistic regression by using the the Gaussian Kernel; it constructs a set of hyperplanes in a high dimensional space, which can be used for classification. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training-data point of any class. The “tuneLength” parameter expands the range of regularization values that the SVM may be run against (0.1 to 1000).

library(doMC)
registerDoMC(cores=8)
system.time(
        svm.fit <- train(classe~., data=train,
                         preProcess=c('center','scale','nzv'),
                         method='svmRadial',
                         tuneLength=5,
                         trControl = trainControl(
                                method = "repeatedcv",
                                number = 10,
                                repeats = 5,
                             allowParallel = TRUE
                                )
                         )
        
)
##     user   system  elapsed 
## 8410.957  618.308 1300.648
svm.fit
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 13737 samples
##    54 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (53), scaled (53), remove (1) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 12363, 12364, 12363, 12364, 12365, 12364, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa      Accuracy SD  Kappa SD   
##   0.25  0.8659822  0.8304843  0.010039178  0.012691482
##   0.50  0.8994834  0.8727011  0.008348098  0.010581001
##   1.00  0.9265193  0.9069354  0.007312224  0.009268575
##   2.00  0.9496681  0.9362569  0.006584509  0.008345246
##   4.00  0.9683191  0.9598865  0.004588395  0.005813084
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01352209
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were sigma = 0.01352209 and C = 4.

Finally, we see how the SVM behaves with repeated cross fold validation, when compared to its regularization parameter, or the Cost:

We see that an expected accuracy of 96.8319086% from the SVM is reasonable, at the most optimal regularization parameter. This gives us an Out of Sample error rate that we can expect: 0.0316809%.

PREDICTING and VALIDATING RESULTS

Confusion Matrices - Out of Sample Performance

We’ll now run each of the solutions against the untouched test (cross validation) set to generate confusion matrices and generate true out of sample errors for each algorithm.

pred.rpart <- predict(rpart.fit, newdata=cv)
pred.rf <- predict(rf.fit, newdata=cv)
pred.svm <- predict(svm.fit, newdata=cv)

 confusionMatrix(pred.rpart, cv$classe) # CART, recursive partitioning
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1345  227  120  221   39
##          B   70  516   75  225  198
##          C  239  396  831  471  150
##          D    0    0    0    0    0
##          E   20    0    0   47  695
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5755          
##                  95% CI : (0.5628, 0.5882)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4588          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8035  0.45303   0.8099   0.0000   0.6423
## Specificity            0.8559  0.88032   0.7415   1.0000   0.9861
## Pos Pred Value         0.6890  0.47601   0.3982      NaN   0.9121
## Neg Pred Value         0.9163  0.87024   0.9487   0.8362   0.9245
## Prevalence             0.2845  0.19354   0.1743   0.1638   0.1839
## Detection Rate         0.2285  0.08768   0.1412   0.0000   0.1181
## Detection Prevalence   0.3317  0.18420   0.3546   0.0000   0.1295
## Balanced Accuracy      0.8297  0.66667   0.7757   0.5000   0.8142
 confusionMatrix(pred.rf, cv$classe) # Random Forest
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    4    0    0    0
##          B    0 1131    2    0    0
##          C    0    4 1024    2    0
##          D    0    0    0  962    0
##          E    0    0    0    0 1082
## 
## Overall Statistics
##                                           
##                Accuracy : 0.998           
##                  95% CI : (0.9964, 0.9989)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9974          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9930   0.9981   0.9979   1.0000
## Specificity            0.9991   0.9996   0.9988   1.0000   1.0000
## Pos Pred Value         0.9976   0.9982   0.9942   1.0000   1.0000
## Neg Pred Value         1.0000   0.9983   0.9996   0.9996   1.0000
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1922   0.1740   0.1635   0.1839
## Detection Prevalence   0.2851   0.1925   0.1750   0.1635   0.1839
## Balanced Accuracy      0.9995   0.9963   0.9984   0.9990   1.0000
 confusionMatrix(pred.svm, cv$classe) # Support Vector Machine
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673   52    0    1    0
##          B    0 1074   10    0    0
##          C    1    7 1009   67    6
##          D    0    0    5  896   10
##          E    0    6    2    0 1066
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9716          
##                  95% CI : (0.9671, 0.9757)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9641          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9429   0.9834   0.9295   0.9852
## Specificity            0.9874   0.9979   0.9833   0.9970   0.9983
## Pos Pred Value         0.9693   0.9908   0.9257   0.9835   0.9926
## Neg Pred Value         0.9998   0.9865   0.9965   0.9863   0.9967
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1825   0.1715   0.1523   0.1811
## Detection Prevalence   0.2933   0.1842   0.1852   0.1548   0.1825
## Balanced Accuracy      0.9934   0.9704   0.9834   0.9632   0.9918

We see that the Random Forest algorithm with a 99.7960918% accuracy on the test set causes the least amount of Out of Sample error (0.0020391%), when compared to SVM and Recursive Partitioning. This is within our previously anticipated Out of Sample error. We therefore select the Random Forest solution (model #2) as the final model.

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 28
## 
##         OOB estimate of  error rate: 0.23%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3902    3    0    0    1 0.001024066
## B    6 2648    3    1    0 0.003762227
## C    0    3 2393    0    0 0.001252087
## D    0    0   10 2241    1 0.004884547
## E    0    1    0    3 2521 0.001584158

End.