12.5 Convert Numerical Data to Categorical

Suppose that you wanted to use the Income variable as a categorical variable instead of a numerical variable. We would need to define how we want to parse the data into buckets. The first decision is to decide the number of buckets. The second decision is to decide how to allocate the data into the buckets.

For the Education variable example in Section 12.1, we chose three buckets, but also suggested that more (or less) could be completed.

For Income, one way would be to create equally sized buckets of some number of buckets. Another way is to examine the distribution and decide on reasonable split points (sometimes called cut points).

We could plot a histogram of the data, we could do a summary of the data (as in 12.3), or since the data set is small, we could order the data from smallest to largest.

hist(dat$Income)
summary(dat$Income)
sort(dat$income)

To define the new categorical variable we use the following code:

dat <- within(dat, {   
  Income.cat <- NA # need to initialize variable
  Income.cat[Income < 4000] <- "Low"
  Income.cat[Income >= 4000 & Income < 5000] <- "Middle"
  Income.cat[Income >= 5000] <- "High"
   } )

This code defines the new categorical income variable Income.cat and automatically includes the new variable in the data frame (dat). The cut-points are set so that the median is in the middle of the Middle category. You could stop with this code and feel good.

However note in the code that follows. Income.cat is shown as a chr, or character variable. And the results of the summary() function are not meaningful.

str(dat)
summary(dat$Income.cat)

The next line defines Income.cat as a factor variable and sets the ordering of the buckets with the levels() parameter. The category noted first is called the Reference category. This ordering is important for some analytic methods that you will complete.

You observe now that the results reflect Income.cat as a factor variable.

dat$Income.cat <- factor(dat$Income.cat, levels = c("High", "Middle", "Low"))
 
str(dat)
  
summary(dat$Income.cat)