Introduction

As I work more and more with R I find myself repeating the same tasks, and the most obvious place to start is with loading data into R and better understanding the structure. The following steps are pretty common for working with any data in R. (I authored this and my other R cheat sheets directly in R Markdown1 which automatically displays the R code, runs the code, then includes the output in the HTML. )

Load sample data using the read.table() R function

Lets load some sample data into a data table to work with using the R read.table() function 2.

require("data.table")
## Loading required package: data.table
sampleDT <- read.table("people-data.csv", # The people-data.csv file from the current working directory 
                       header = T, # Our CSV file has column header information in the first line
                       sep=",", # Our CSV file uses a comma to separate each column
                       stringsAsFactors = F, # false so that each column is loaded as a set of strings
                       colClasses=c("Followers"="numeric", "Dog.person"="factor")
)
head(sampleDT)
##   Person Province Dog.person Cat.person Followers Points
## 1  Alice       ON        Yes         No       771  8,892
## 2    Bob       ON        Yes        Yes       251  3,352
## 3   Chao       QC         No         No       558  2,223
## 4  David       ON        Yes        Yes       431    221
## 5   Eric       ON         No         No       589   5549
## 6   Fran       NB         No        Yes       121      0

Explore the data we loaded

R provides a lot of ways to quickly get a sense of data - data types, dimensions, ranges of values, etc…

Quick R commands to understand a data table

Because we included the header = T attribute in the read.table() call we can use the column names directly in these sorts of commands:

Operation R command Result
Display the dimensions dim( sampleDT ) 11, 6
Or just the row count nrow( sampleDT ) 11
…and column count ncol( sampleDT ) 6
Find the column names names( sampleDT ) Person, Province, Dog.person, Cat.person, Followers, Points
Factors in the Dog person column levels( sampleDT$Dog.person ) No, Yes
Count the number of factors length(levels( sampleDT$Dog.person )) 2

The str() R function displays a lot of this information all in one call. Useful to look at, as opposed to the above functions that you might want to use programatically:

str(sampleDT)
## 'data.frame':    11 obs. of  6 variables:
##  $ Person    : chr  "Alice" "Bob" "Chao" "David" ...
##  $ Province  : chr  "ON" "ON" "QC" "ON" ...
##  $ Dog.person: Factor w/ 2 levels "No","Yes": 2 2 1 2 1 1 2 1 1 1 ...
##  $ Cat.person: chr  "No" "Yes" "No" "Yes" ...
##  $ Followers : num  771 251 558 431 589 121 500 627 479 58 ...
##  $ Points    : chr  "8,892" "3,352" "2,223" "221" ...

Use the built in R function summary() on a row to get more details - the function returns different information depending on the data type::

summary(sampleDT$Followers)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    58.0   341.0   500.0   461.5   608.0   771.0
summary(sampleDT$Dog.person)
##  No Yes 
##   6   5

Display a quick graph

Or how about a quick graph of the Followers column to get a sense of the same data:

    par(las=2) # Want the titles to be horizontal
    barplot( sampleDT$Followers, horiz=TRUE, names.arg = sampleDT$Person)

I have a separate cheatsheet on creating a bar chart from MySQL data and another cheatsheet on graphing web usage statistics.

Find unique values in a column

The Province column has a limited number of values - here’s a sorted list of the unique values:

provinceUniqueSorted <- sort(unique(sampleDT$Province))
cat(provinceUniqueSorted)
## AB CA NB NS ON QC

Clean up the values in a column

We loaded the Points column as a set of characters, with commas as thousands separators and not all people have a value. We’ll change the column in-place so that we can continue working with it as number. (I realize we could/should have loaded this column as numeric to start with, like we did with the Followers column and just run sum(sampleDR$Followers, na.rm = TRUE) = 5076.)

Before: sampleDT$Points is: 8,892, 3,352, 2,223, 221, 5549, 0, 3589, 5579, 8,359, 0, 1,000

We’ll run one command to update the column… working from the inside, out, we’ll:

  1. remove the commas from the string with gsub()
  2. convert the string to a number with as.numeric()
sampleDT[,6] <- as.numeric(gsub(",", "", sampleDT[,6]), ignore.na=T)

After: once updated, the column sampleDT$Points is: 8892, 3352, 2223, 221, 5549, 0, 3589, 5579, 8359, 0, 1000

Find the sum of the values in a column

We can now find the sum with:

  • sum( sampleDT$Points ): 3.876410^{4}
  • sum( sampleDT$Followers ): 5076

References


  1. More information on the version of R Markdown I’m using is available on theRStudio website. They describe it as “an authoring format that enables easy creation of dynamic documents, presentations, and reports from R. It combines the core syntax of markdown (an easy-to-write plain text format) with embedded R code chunks that are run so their output can be included in the final document.” Here’s a useful R Markdown cheat sheet that I’ve used often.

  2. Documentation for read.table() at https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html