As I work more and more with R I find myself repeating the same tasks, and the most obvious place to start is with loading data into R and better understanding the structure. The following steps are pretty common for working with any data in R. (I authored this and my other R cheat sheets directly in R Markdown1 which automatically displays the R code, runs the code, then includes the output in the HTML. )
Lets load some sample data into a data table to work with using the R read.table()
function 2.
require("data.table")
## Loading required package: data.table
sampleDT <- read.table("people-data.csv", # The people-data.csv file from the current working directory
header = T, # Our CSV file has column header information in the first line
sep=",", # Our CSV file uses a comma to separate each column
stringsAsFactors = F, # false so that each column is loaded as a set of strings
colClasses=c("Followers"="numeric", "Dog.person"="factor")
)
head(sampleDT)
## Person Province Dog.person Cat.person Followers Points
## 1 Alice ON Yes No 771 8,892
## 2 Bob ON Yes Yes 251 3,352
## 3 Chao QC No No 558 2,223
## 4 David ON Yes Yes 431 221
## 5 Eric ON No No 589 5549
## 6 Fran NB No Yes 121 0
R provides a lot of ways to quickly get a sense of data - data types, dimensions, ranges of values, etc…
Because we included the header = T
attribute in the read.table()
call we can use the column names directly in these sorts of commands:
Operation | R command | Result |
---|---|---|
Display the dimensions | dim( sampleDT ) |
11, 6 |
Or just the row count | nrow( sampleDT ) |
11 |
…and column count | ncol( sampleDT ) |
6 |
Find the column names | names( sampleDT ) |
Person, Province, Dog.person, Cat.person, Followers, Points |
Factors in the Dog person column | levels( sampleDT$Dog.person ) |
No, Yes |
Count the number of factors | length(levels( sampleDT$Dog.person )) |
2 |
The str()
R function displays a lot of this information all in one call. Useful to look at, as opposed to the above functions that you might want to use programatically:
str(sampleDT)
## 'data.frame': 11 obs. of 6 variables:
## $ Person : chr "Alice" "Bob" "Chao" "David" ...
## $ Province : chr "ON" "ON" "QC" "ON" ...
## $ Dog.person: Factor w/ 2 levels "No","Yes": 2 2 1 2 1 1 2 1 1 1 ...
## $ Cat.person: chr "No" "Yes" "No" "Yes" ...
## $ Followers : num 771 251 558 431 589 121 500 627 479 58 ...
## $ Points : chr "8,892" "3,352" "2,223" "221" ...
Use the built in R function summary()
on a row to get more details - the function returns different information depending on the data type::
summary(sampleDT$Followers)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 58.0 341.0 500.0 461.5 608.0 771.0
summary(sampleDT$Dog.person)
## No Yes
## 6 5
Or how about a quick graph of the Followers
column to get a sense of the same data:
par(las=2) # Want the titles to be horizontal
barplot( sampleDT$Followers, horiz=TRUE, names.arg = sampleDT$Person)
I have a separate cheatsheet on creating a bar chart from MySQL data and another cheatsheet on graphing web usage statistics.
The Province
column has a limited number of values - here’s a sorted list of the unique values:
provinceUniqueSorted <- sort(unique(sampleDT$Province))
cat(provinceUniqueSorted)
## AB CA NB NS ON QC
We loaded the Points
column as a set of characters, with commas as thousands separators and not all people have a value. We’ll change the column in-place so that we can continue working with it as number. (I realize we could/should have loaded this column as numeric to start with, like we did with the Followers
column and just run sum(sampleDR$Followers, na.rm = TRUE)
= 5076.)
Before: sampleDT$Points
is: 8,892, 3,352, 2,223, 221, 5549, 0, 3589, 5579, 8,359, 0, 1,000
We’ll run one command to update the column… working from the inside, out, we’ll:
gsub()
as.numeric()
sampleDT[,6] <- as.numeric(gsub(",", "", sampleDT[,6]), ignore.na=T)
After: once updated, the column sampleDT$Points
is: 8892, 3352, 2223, 221, 5549, 0, 3589, 5579, 8359, 0, 1000
We can now find the sum with:
sum( sampleDT$Points )
: 3.876410^{4}sum( sampleDT$Followers )
: 5076More information on the version of R Markdown
I’m using is available on theRStudio website. They describe it as “an authoring format that enables easy creation of dynamic documents, presentations, and reports from R. It combines the core syntax of markdown (an easy-to-write plain text format) with embedded R code chunks that are run so their output can be included in the final document.” Here’s a useful R Markdown cheat sheet that I’ve used often.↩
Documentation for read.table() at https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html↩