10.4 Subset the Data
Many times you have a lot of data and for a particular analysis you do not need the entire data frame.
- Instead of working with the entire data set, we can take just a few rows or a few columns from the data frame that we created in Section 10.1. Suppose we create a new object dat1 from our data frame dat with the following code.
<- dat[1:3, ]
dat1 dat1
By doing this, you create a new data frame object that is based on our original data.
What data does dat1 represent? Data frames are represented by rows and columns as we saw above. The code above results in copying the first 3 rows of dat and pasting it in the new data frame dat1. See how all of the variables are copied to dat1. Note now that the Global Environment window shows the new data frame is created, now with only 3 observations and all of the original variables.
- We could also take just a few of the columns using the following code. Note that there is a comma inside of the left bracket.
<- dat1[ , 2:3]
dat2 dat2
For those familiar with matrices, this notation of [r,c] is in the format of rows and columns. Entries before the comma represent the rows of the data frame, while entries after the comma represent the columns.
For larger data frames this approach could be troublesome. Instead you can take just a few of the columns using their column names:
<- dat1[ , c("t","accum") ]
dat3 dat3
Here we use quotes around the variable names (remember that R is picky for upper and lower case, as well as spaces), and we use the c() notation that operates as a combine operator.
Be especially careful with the "". If you copy code from the tutorial, you may end up with smart quotes (where they might show as curly). If your code does not work if you copy the code, try re-entering the quotes so that they are really straight quotes and not curly quotes. Check out smart quotes vs curly quotes to see the different looks of quotation marks.
- Another way of subsetting the data frame to keep only a specific group of variables is:
<- c("t","accum")
keep
keep
<- dat[1:3, keep]
dat4 dat4
Here we create a new object keep that contains just the names of the two variables we want to retain. Then we use that keep object in the subsetting. When subsetting a large number of variables this approach is useful as it helps keep the coding clear and well-documented.
- Or we might want to keep the rows of the data frame that satisfy a particular criterion:
<- dat[which(dat$t > 2), ]
dat5 dat5
What you notice after you use which is that your data frame (dat5) is smaller and only contains row where \(t > 2\). You can have more complicated logical expressions by using the AND (&) and the OR (|) logical operators. An example is presented in Section 12.4.