library(caret)
train.url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test.url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
This HAR analysis aims to predict the type of activity performed by subjects wearing a wearable computing device. The type of activities fall into 5 classes:
The dataset provided by this project will be used to train a few classification models to attempt to determine the likelihood of the activity. The results of each model will be evaluated against a separate cross validation (testing) dataset that has not been included in training the models.
We begin by reading in the raw data provided from the project. The dataset has 159 potential predictors for the outcome variable “classe”. The “test” set here is the target set where the outcome is unknown and we want to use our best solution to predict on this set. This set will need to eliminate the same variables as the raw dataset.
raw <- read.csv(train.url, header = T, na.strings = c(""," ","#DIV/0!","NA"))
test <- read.csv(test.url, header = T, na.strings = c(""," ","#DIV/0!","NA"))
dim(raw)
## [1] 19622 160
We now explore these potential predictors to evaluate their suitability for inclusion in the classification exercise. We see that the first 5 columns consist of subject ids, names, and activity timestamps. We eliminate these because they aren’t likely contributors to the models. Further, since we want to train our models on as complete a dataset as possible, we conduct an exercise to only keep features where at least 75% of the values are known.
raw <- raw[,-c(1:5)]
test <- test[,-c(1:5)]
raw <- raw[, -which(colSums(is.na(raw))/nrow(raw) > 0.75)]
test <- test[, -which(colSums(is.na(test))/nrow(test) > 0.75)]
dim(raw)
## [1] 19622 55
Quite a few variables were eliminated because of extremely sparse data.
We’ll split the raw dataset into the train and test (cross validation, “cv”) dataset. This’ll ensure that the same features are used in training as well as during validation, while the model is trained only on the training set.
set.seed(31415)
trainInds <- createDataPartition(raw$classe, p=.7, list = F)
train <- raw[trainInds, ]
cv <- raw[-trainInds, ]
In order to get an idea of the dumbell orientation features with the outcome, let’s look at a scatterplot of the Total Measurements with the Classe outcome.
We see that each of the total measurements have small correlations to the Classe outcome. For e.g. while the totalaccelbelt measurement seems to have a lot of data indicating classe A, it’s not so well correlated with the other classes.
We now apply various classification algorithms to our training dataset. Each training model will be run in a parallelized fashion with tuning parameters that enable multiple cross fold validation. This will help determine the accuracy of the final model.
Pre-Processing Note: I pre-process the training set by centering and scaling before training each model. This is to account for the differences in scales between some features in teh dataset. Moreover, features, especially discrete types, that exhibit near zero variability will not inform the analysis. It’s therefore prudent to remove these features from the analysis, in addition to centering and scaling the training sets.
We’ll start by using a classification tree to try and find the decision points in the dataset.The aim here is to try and find the decision points that best split the data until we arrive at homogenous activities. We use repeated cross validation as our accuracy metric to arrive at average predictions on multiple runs.
# enable parallelization
library(doMC)
registerDoMC(cores=8)
system.time(
rpart.fit <- train(classe~., data=train, preProcess=c('center','scale','nzv'),
method="rpart",
trControl = trainControl(
method='repeatedcv',
number = 10,
repeats = 3,
allowParallel = TRUE
))
)
## user system elapsed
## 66.159 4.155 15.827
rpart.fit
## CART
##
## 13737 samples
## 54 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## Pre-processing: centered (53), scaled (53), remove (1)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 12363, 12364, 12363, 12363, 12363, 12364, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa Accuracy SD Kappa SD
## 0.03143119 0.5876567 0.47416089 0.03490112 0.04524788
## 0.04919811 0.4487083 0.26122078 0.08233405 0.13435496
## 0.11463737 0.3180258 0.05167899 0.03925844 0.06015374
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03143119.
We see Recursive Partitioning has a very poor classification accuracy of 58.7656737% on the training set. Let’s now attempt a more time and process intensive random traverse approach.
We’ll next run a random forest classifier to try and improve the accuracy of the partitioning algorithm. Perhaps randomizing the walk down various trees in teh dataset will offer better insights into the partitioning.
library(doMC)
registerDoMC(cores = 8)
system.time(
rf.fit <- train(classe~., data=train, preProcess=c('center','scale','nzv'),
method="rf",
trControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 5,
allowParallel = TRUE
)
)
)
## user system elapsed
## 4395.296 38.526 730.785
rf.fit
## Random Forest
##
## 13737 samples
## 54 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## Pre-processing: centered (53), scaled (53), remove (1)
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 10988, 10988, 10991, 10990, 10991, 10990, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9941328 0.9925783 0.001555464 0.001967966
## 28 0.9967825 0.9959302 0.001205562 0.001524911
## 54 0.9953410 0.9941068 0.001405872 0.001778254
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 28.
The random forest’s accuracy is a big improvement over the partitioning algorithm run earlier. The accuracy behaved thus on upon repeated cross validation:
From this, and taking the average across all runs, we get an expected accuracy of 99.5418766% from the random forest algorithm. This gives us an Expected Out of Sample error rate: 0.0045812%.
Now, since we have a good chunk of data, with > 10-20 features, a Support Vector Machine classifier might help in determining a good classification boundary between the various outcome classes. The support vector machine improves on logistic regression by using the the Gaussian Kernel; it constructs a set of hyperplanes in a high dimensional space, which can be used for classification. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training-data point of any class. The “tuneLength” parameter expands the range of regularization values that the SVM may be run against (0.1 to 1000).
library(doMC)
registerDoMC(cores=8)
system.time(
svm.fit <- train(classe~., data=train,
preProcess=c('center','scale','nzv'),
method='svmRadial',
tuneLength=5,
trControl = trainControl(
method = "repeatedcv",
number = 10,
repeats = 5,
allowParallel = TRUE
)
)
)
## user system elapsed
## 8410.957 618.308 1300.648
svm.fit
## Support Vector Machines with Radial Basis Function Kernel
##
## 13737 samples
## 54 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## Pre-processing: centered (53), scaled (53), remove (1)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 12363, 12364, 12363, 12364, 12365, 12364, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa Accuracy SD Kappa SD
## 0.25 0.8659822 0.8304843 0.010039178 0.012691482
## 0.50 0.8994834 0.8727011 0.008348098 0.010581001
## 1.00 0.9265193 0.9069354 0.007312224 0.009268575
## 2.00 0.9496681 0.9362569 0.006584509 0.008345246
## 4.00 0.9683191 0.9598865 0.004588395 0.005813084
##
## Tuning parameter 'sigma' was held constant at a value of 0.01352209
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.01352209 and C = 4.
Finally, we see how the SVM behaves with repeated cross fold validation, when compared to its regularization parameter, or the Cost:
We see that an expected accuracy of 96.8319086% from the SVM is reasonable, at the most optimal regularization parameter. This gives us an Out of Sample error rate that we can expect: 0.0316809%.
We’ll now run each of the solutions against the untouched test (cross validation) set to generate confusion matrices and generate true out of sample errors for each algorithm.
pred.rpart <- predict(rpart.fit, newdata=cv)
pred.rf <- predict(rf.fit, newdata=cv)
pred.svm <- predict(svm.fit, newdata=cv)
confusionMatrix(pred.rpart, cv$classe) # CART, recursive partitioning
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1345 227 120 221 39
## B 70 516 75 225 198
## C 239 396 831 471 150
## D 0 0 0 0 0
## E 20 0 0 47 695
##
## Overall Statistics
##
## Accuracy : 0.5755
## 95% CI : (0.5628, 0.5882)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4588
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8035 0.45303 0.8099 0.0000 0.6423
## Specificity 0.8559 0.88032 0.7415 1.0000 0.9861
## Pos Pred Value 0.6890 0.47601 0.3982 NaN 0.9121
## Neg Pred Value 0.9163 0.87024 0.9487 0.8362 0.9245
## Prevalence 0.2845 0.19354 0.1743 0.1638 0.1839
## Detection Rate 0.2285 0.08768 0.1412 0.0000 0.1181
## Detection Prevalence 0.3317 0.18420 0.3546 0.0000 0.1295
## Balanced Accuracy 0.8297 0.66667 0.7757 0.5000 0.8142
confusionMatrix(pred.rf, cv$classe) # Random Forest
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 4 0 0 0
## B 0 1131 2 0 0
## C 0 4 1024 2 0
## D 0 0 0 962 0
## E 0 0 0 0 1082
##
## Overall Statistics
##
## Accuracy : 0.998
## 95% CI : (0.9964, 0.9989)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9974
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9930 0.9981 0.9979 1.0000
## Specificity 0.9991 0.9996 0.9988 1.0000 1.0000
## Pos Pred Value 0.9976 0.9982 0.9942 1.0000 1.0000
## Neg Pred Value 1.0000 0.9983 0.9996 0.9996 1.0000
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1922 0.1740 0.1635 0.1839
## Detection Prevalence 0.2851 0.1925 0.1750 0.1635 0.1839
## Balanced Accuracy 0.9995 0.9963 0.9984 0.9990 1.0000
confusionMatrix(pred.svm, cv$classe) # Support Vector Machine
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 52 0 1 0
## B 0 1074 10 0 0
## C 1 7 1009 67 6
## D 0 0 5 896 10
## E 0 6 2 0 1066
##
## Overall Statistics
##
## Accuracy : 0.9716
## 95% CI : (0.9671, 0.9757)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9641
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9429 0.9834 0.9295 0.9852
## Specificity 0.9874 0.9979 0.9833 0.9970 0.9983
## Pos Pred Value 0.9693 0.9908 0.9257 0.9835 0.9926
## Neg Pred Value 0.9998 0.9865 0.9965 0.9863 0.9967
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1825 0.1715 0.1523 0.1811
## Detection Prevalence 0.2933 0.1842 0.1852 0.1548 0.1825
## Balanced Accuracy 0.9934 0.9704 0.9834 0.9632 0.9918
We see that the Random Forest algorithm with a 99.7960918% accuracy on the test set causes the least amount of Out of Sample error (0.0020391%), when compared to SVM and Recursive Partitioning. This is within our previously anticipated Out of Sample error. We therefore select the Random Forest solution (model #2) as the final model.
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 28
##
## OOB estimate of error rate: 0.23%
## Confusion matrix:
## A B C D E class.error
## A 3902 3 0 0 1 0.001024066
## B 6 2648 3 1 0 0.003762227
## C 0 3 2393 0 0 0.001252087
## D 0 0 10 2241 1 0.004884547
## E 0 1 0 3 2521 0.001584158