The purpose of this project is to demonstrate collecting, manipulating, and cleaning a data set. Utilizing data collected from the accelerometers within the Samsung Galaxy S smartphone found here (http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones) I merged data sets, extracted specific subsets, performed generally cleaning, executed some calculations, and exported a specific extraction to be used for future analysis.
Documentation explaining the project and how to use files contained in the repository.
CodeBook.md
Codebook describing the tidydataoutput.txt file layout.
run_analysis.R
R script to download, merge, extract, clean, and subset the datasets. See process section below for further details.
tidydataoutput.txt
Final data extraction for future analysis.
The Process
I created the script "run_analysis.R" which does the following:
It downloads and unzips the original data sets, loads the necessary libraries (library(dplyr) and library(data.table)) and reads the data into R. There are two distinct directories of test and training data, to which three text files are provided in each relating to the the experiment subject (read in as testsubject or trainsubject objects), activity (read in as testy or trainy), and features (read in as testx or trainx) are provided.
It joins like data sets (ex testx with testy and testsubject with trainsubject) to create 3 objects with the test and train populations combined. colnames() is used to clean and %>% relocate(Subject) %>% is used to arrange all three objects which are then joined into a single data set via cbind() to create the totaldata object.
It creates a subset of data by extracting only mean and standard deviation measurements now stored in the object extractedtotaldata. I utilized grep("mean\(\)|std\(\)", names(totaldata), ignore.case = TRUE) to extract the substring of any column names (features) that contained mean and standard deviation measurements.
It cleanes up variable names to be more/better descriptive utilizing the gsub() function.
It creates a second, independent tidy data set with the average of each variable for each activity and each subject. I utilized the following code to extract and then order the new subset: aggregate(. ~Subject + Activity, extractedtotaldata, mean) and tidydata <- tidydata[order(tidydata$Subject,tidydata$Activity),].
gettingandcleaningdata-peer-graded-assignment-course-project1's People