Author: Brad Cable
Illinois State University
For Practical Machine Learning - Johns Hopkins University (Coursera)
The data being analyzed was previously researched by Wallace Ugulino, Eduardo Velloso, and Hugo Fuks in the paper and data available here:
http://groupware.les.inf.puc-rio.br/har
The main part of their research and purpose of machine learning in their context is the following:
This human activity recognition research has traditionally focused on discriminating between different activities, i.e. to predict “which” activity was performed at a specific point in time (like with the Daily Living Activities dataset above). The approach we propose for the Weight Lifting Exercises dataset is to investigate “how (well)” an activity was performed by the wearer. The “how (well)” investigation has only received little attention so far, even though it potentially provides useful information for a large variety of applications,such as sports training.
This paper attempts to create a classifier that performs similar analysis and classifies by how well an individual performs actions and not which actions the individual is performing.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
library(rpart)
library(ggplot2)
training <- read.csv("pml-training.csv")
testing <- read.csv("pml-testing.csv")
set.seed(43121)
This function calculates out what percentage of the data has missing values or not.
propMissing <- function(df, name){
vect <- df[[name]]
sum((is.na(vect) | vect == "")*1)/length(vect)
}
Calculating out the percentage of missing values for each field:
fieldsMissing <- sapply(names(training), function(name){ propMissing(training, name)})
Plotting the proportion of missing values, you can see that all variables only have two levels of missing data.
plot(fieldsMissing, ylab="Proportion Missing")
Looking at the levels:
lvls <- levels(factor(fieldsMissing))
lvls
## [1] "0" "0.979308938946081"
Either a particular variable has one hundred percent of the values present, or it has a missing value proportion of: 0.9793, which is quite high.
My hypothesis is that these fields with missing values are all the same records and the same values missing in each case, which I will test next.
Now to test the hypothesis that all the values that are missing are the same records.
First we need to subset the original training set and view only the fields that are theorized to be missing under the same circumstances:
whichFieldsMissing <- fieldsMissing
names(whichFieldsMissing) <- NULL
whichFieldsMissing <- whichFieldsMissing != 0
sketchyTrainingFields <- training[whichFieldsMissing]
Now we need to calculate per column which records are present.
simpleSketchyTrainingFields <- sapply(names(sketchyTrainingFields), function(name){
is.na(sketchyTrainingFields[[name]]) |
sketchyTrainingFields[[name]] == ""
})
And loop through to see if they are all equal or not:
allSame <- TRUE
for (i in 2:ncol(simpleSketchyTrainingFields)){
if(sum((
simpleSketchyTrainingFields[,1] !=
simpleSketchyTrainingFields[,i]
)*1) > 0){
allSame <- FALSE
break
}
}
allSame
## [1] TRUE
The result is that they are all the same records with missing values.
When the model trains it will likely with such a low quantity of actual data be only training on the presence of these values, and not the values themselves, so it might be important to provide the presence of these values to help train but not to provide the values themselves as they could produce strange results.
To do this, we can take any of the fields that have this missing values pattern and convert those fields into a single field that is a boolean value on the presense of data in those fields. This provides the ability to learn on the presense without skewing the data to these values that are infrequent enough to train on.
First we need to remove those fields from the training and test sets.
cleanTraining <- training[!whichFieldsMissing]
Now we can add a new field that is the presence of these fields or not:
cleanTraining$presence <- !is.na(training$max_roll_belt)
Taking a look at the timestamps provided there are some strange aspects about them. The fields in question are “raw_timestamp_part_1”, “raw_timestamp_part_2”, and “cvtd_timestamp”.
Looking at this variable you can see that there is very little variability and that each of the rows can be grouped into a small amount of timeframes.
lvls <- levels(training$cvtd_timestamp)
lvls
## [1] "02/12/2011 13:32" "02/12/2011 13:33" "02/12/2011 13:34"
## [4] "02/12/2011 13:35" "02/12/2011 14:56" "02/12/2011 14:57"
## [7] "02/12/2011 14:58" "02/12/2011 14:59" "05/12/2011 11:23"
## [10] "05/12/2011 11:24" "05/12/2011 11:25" "05/12/2011 14:22"
## [13] "05/12/2011 14:23" "05/12/2011 14:24" "28/11/2011 14:13"
## [16] "28/11/2011 14:14" "28/11/2011 14:15" "30/11/2011 17:10"
## [19] "30/11/2011 17:11" "30/11/2011 17:12"
With only 20 levels detailing down to the minute, you would expect a lot more different types of results in this area.
lvls <- length(levels(factor(training$raw_timestamp_part_1)))
lvls
## [1] 837
This field has a bit more variability to it with 837 levels, but the format appears to have a strange property to it. Namely, the first half of the values appear to be in some way related to the cvtd_timestamp value and the second half appears to be a count. It would be difficult to display and analyze all 1 values here and reverse engineering this field is not desired when there are even stranger aspects of the rest of these fields.
lvls <- length(levels(factor(training$raw_timestamp_part_2)))
lvls
## [1] 16783
With 16783 levels, this can be seen as having too high of variability to predict anything of significance. On top of which, if the value is microseconds, this value can do little to predict how well the individual is doing at the action since the information would essentially be random noise.
Because of the time information being essentially either too low of precision data or too high of precision data, all time data is going to be stripped out of the model.
cleanTraining$cvtd_timestamp <- NULL
cleanTraining$raw_timestamp_part_1 <- NULL
cleanTraining$raw_timestamp_part_2 <- NULL
X appears to be a simple row count, which does us no good since this isn't exactly timeseries data. The timestamps exist but the order of them are not particularly relevant and the values of them are not particularly relevant, so we'll strip this value out.
head(cleanTraining$X, 100)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## [18] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## [35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
## [52] 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
## [69] 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
## [86] 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
cleanTraining$X <- NULL
By calculating the variable importance of a small subset of the data using a random forest, we can gather more information about what variables are useful and what variables are not.
To start, we subset the training data into 1% of the original for a quick analysis of the data.
quickPartition <- createDataPartition(cleanTraining$classe, p=0.01, list=FALSE)
Now we generate a random forest model on the smaller dataset:
subsetTest <- cleanTraining[quickPartition,]
subsetModelRF <- train(classe ~ ., data=subsetTest, method="rf")
Calculating variable importance, cleaning it up, and graphing the results:
importance_graph <- function(model){
subsetImportance <- varImp(model)
importanceGraphData <- subsetImportance$importance
importanceGraphData$Name <- row.names(importanceGraphData)
importanceGraphData <- importanceGraphData[order(-importanceGraphData[[1]]),]
importanceGraphData$Name <- factor(importanceGraphData$Name, levels=importanceGraphData$Name)
importanceGraphData <- importanceGraphData[importanceGraphData$Overall > 0,]
g <- ggplot(data.frame(x=importanceGraphData$Overall, y=importanceGraphData$Name), aes(y=y, x=x))
g + geom_point() + xlab("Overall Importance Factor") + ylab("Name")
}
importance_graph(subsetModelRF)
From this analysis we can see that the user, specificially who is exercising, has very little predictive value. Also to note is that the new_window variable has very little predictive value as is our calculation of presence in regards to the missing data. This should have been expected based on the proportion of missing values to actual information, so the hypothesis that the presence could have affected a prediction model is false.
These variables with an importance factor that have a significant drop and are below the value 1 can also be stripped out of the training data.
cleanTraining$new_window <- NULL
cleanTraining$user_name <- NULL
cleanTraining$presence <- NULL
While a Random Forest worked, it would be interesting to see how a CART model worked on this data, so using the same subset dataset we can take a quick look at the data before using up a large quantity of the test data for the final models.
Keep in mind since we are re-using the subsetTest variable, this is modeling with the user_name, new_window, and presence variables in mind, still, but this should not affect the overall exploratory model in the long run.
subsetModelCART <- train(classe ~ ., data=subsetTest, method="rpart")
The CART model itself looks like the following, with noticeably less variables and not very many leaves put into the model. For CART based classification trees, this is actually a fairly good sign as too many variables could produce large amounts of overfitting:
subsetModelCART$finalModel
## n= 199
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 199 143 A (0.28 0.19 0.18 0.17 0.19)
## 2) roll_belt< 130.5 181 125 A (0.31 0.21 0.19 0.18 0.1)
## 4) magnet_arm_x< 71 73 31 A (0.58 0.14 0.18 0.068 0.041) *
## 5) magnet_arm_x>=71 108 80 B (0.13 0.26 0.2 0.26 0.15)
## 10) pitch_forearm< 24.9 74 51 B (0.15 0.31 0.3 0.11 0.14) *
## 11) pitch_forearm>=24.9 34 14 D (0.088 0.15 0 0.59 0.18) *
## 3) roll_belt>=130.5 18 0 E (0 0 0 0 1) *
An importance graph, as used above, of the same data:
importance_graph(subsetModelCART)
Taking a look at the final clean testing data to verify accuracy and to just in general realize what we are using:
str(cleanTraining)
## 'data.frame': 19622 obs. of 54 variables:
## $ num_window : int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ total_accel_dumbbell: int 37 37 37 37 37 37 37 37 37 37 ...
## $ gyros_dumbbell_x : num 0 0 0 0 0 0 0 0 0 0 ...
## $ gyros_dumbbell_y : num -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
## $ gyros_dumbbell_z : num 0 0 0 -0.02 0 0 0 0 0 0 ...
## $ accel_dumbbell_x : int -234 -233 -232 -232 -233 -234 -232 -234 -232 -235 ...
## $ accel_dumbbell_y : int 47 47 46 48 48 48 47 46 47 48 ...
## $ accel_dumbbell_z : int -271 -269 -270 -269 -270 -269 -270 -272 -269 -270 ...
## $ magnet_dumbbell_x : int -559 -555 -561 -552 -554 -558 -551 -555 -549 -558 ...
## $ magnet_dumbbell_y : int 293 296 298 303 292 294 295 300 292 291 ...
## $ magnet_dumbbell_z : num -65 -64 -63 -60 -68 -66 -70 -74 -65 -69 ...
## $ roll_forearm : num 28.4 28.3 28.3 28.1 28 27.9 27.9 27.8 27.7 27.7 ...
## $ pitch_forearm : num -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 ...
## $ yaw_forearm : num -153 -153 -152 -152 -152 -152 -152 -152 -152 -152 ...
## $ total_accel_forearm : int 36 36 36 36 36 36 36 36 36 36 ...
## $ gyros_forearm_x : num 0.03 0.02 0.03 0.02 0.02 0.02 0.02 0.02 0.03 0.02 ...
## $ gyros_forearm_y : num 0 0 -0.02 -0.02 0 -0.02 0 -0.02 0 0 ...
## $ gyros_forearm_z : num -0.02 -0.02 0 0 -0.02 -0.03 -0.02 0 -0.02 -0.02 ...
## $ accel_forearm_x : int 192 192 196 189 189 193 195 193 193 190 ...
## $ accel_forearm_y : int 203 203 204 206 206 203 205 205 204 205 ...
## $ accel_forearm_z : int -215 -216 -213 -214 -214 -215 -215 -213 -214 -215 ...
## $ magnet_forearm_x : int -17 -18 -18 -16 -17 -9 -18 -9 -16 -22 ...
## $ magnet_forearm_y : num 654 661 658 658 655 660 659 660 653 656 ...
## $ magnet_forearm_z : num 476 473 469 469 473 478 470 474 476 473 ...
## $ classe : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
We will be using K-Folds (6 folds) to generate testing data and validation data from the original testing dataset to generate accuracy predictions before touching the final testing data provided.
finalModelFolds <- createFolds(cleanTraining$classe, k=6, list=FALSE)
rfTrainData <- cleanTraining[finalModelFolds==1,]
rfPredictData <- cleanTraining[finalModelFolds==2,]
rfModel <- train(classe ~ ., data=rfTrainData, method="rf")
rfPredictions <- predict(rfModel, newdata=rfPredictData)
rfConfusion <- confusionMatrix(rfPredictions, reference=rfPredictData$classe)
rfConfusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 929 9 0 1 1
## B 0 609 17 0 4
## C 1 14 551 12 5
## D 0 0 2 522 6
## E 0 0 0 1 585
##
## Overall Statistics
##
## Accuracy : 0.9777
## 95% CI : (0.972, 0.9825)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9717
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9989 0.9636 0.9667 0.9739 0.9734
## Specificity 0.9953 0.9920 0.9881 0.9971 0.9996
## Pos Pred Value 0.9883 0.9667 0.9451 0.9849 0.9983
## Neg Pred Value 0.9996 0.9913 0.9929 0.9949 0.9940
## Prevalence 0.2845 0.1933 0.1744 0.1640 0.1838
## Detection Rate 0.2842 0.1863 0.1686 0.1597 0.1790
## Detection Prevalence 0.2875 0.1927 0.1783 0.1621 0.1793
## Balanced Accuracy 0.9971 0.9778 0.9774 0.9855 0.9865
cartTrainData <- cleanTraining[finalModelFolds==3,]
cartPredictData <- cleanTraining[finalModelFolds==4,]
cartModel <- train(classe ~ ., data=cartTrainData, method="rpart")
cartPredictions <- predict(cartModel, newdata=cartPredictData)
cartConfusion <- confusionMatrix(cartPredictions, reference=cartPredictData$classe)
cartConfusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 858 340 275 307 74
## B 0 129 10 10 9
## C 66 164 285 192 166
## D 0 0 0 0 0
## E 6 0 0 27 352
##
## Overall Statistics
##
## Accuracy : 0.4966
## 95% CI : (0.4794, 0.5139)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3387
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9226 0.20379 0.50000 0.0000 0.5857
## Specificity 0.5744 0.98900 0.78222 1.0000 0.9876
## Pos Pred Value 0.4628 0.81646 0.32646 NaN 0.9143
## Neg Pred Value 0.9492 0.83805 0.88110 0.8361 0.9137
## Prevalence 0.2844 0.19358 0.17431 0.1639 0.1838
## Detection Rate 0.2624 0.03945 0.08716 0.0000 0.1076
## Detection Prevalence 0.5670 0.04832 0.26697 0.0000 0.1177
## Balanced Accuracy 0.7485 0.59640 0.64111 0.5000 0.7867
Decided to use the random forest, an accuracy rate of 0.9777 is much better than the CART model which has an accuracy rate of 0.4966. Now time to predict on the testing data which hasn't been touched yet.
finalRFModel <- train(classe ~ ., data=cleanTraining, method="rf")
finalPredictions <- predict(finalRFModel, newdata=testing)
finalPredictions
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Predictions submitted to Coursera with 20/20 resulting in correct values.
Memory: | 4GB RAM |
Processor: | Intel® Core™ i5-3317U CPU @ 1.70GHz |
Processor Core Count: | 4 Logical Threads |
OS: | Linux 4.0.8-200.fc21.x86_64 |
R version: | R version 3.2.2 (2015-08-14) |
R Packages Versions: | caret 6.0.58 |
randomForest 4.6.12 | |
rpart 4.1.10 | |
ggplot2 1.0.1 | |
knitr 1.11 |
Report was generated using R Markdown and knitr in 137.360304 minutes at Sun 22 Nov 2015 05:56:48 PM CST
The data used is available here: http://groupware.les.inf.puc-rio.br/har
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human '13) . Stuttgart, Germany: ACM SIGCHI, 2013.