12.1 Types of Data

For any analytics study, you need data. You can think of data as:

  • Numerical (quantative)
  • Categorical (qualitative)
  • Textual (quantitative & qualitative)

Numerical data can be integer-based (discrete) or real numbers (continuous), such as Age, Income, or Education (in years).

Categorical data are represented by buckets. These categorical data are either nominal, like Employment Status, Marital Status, or Occupation, or ordinal such as student course letter grades.

We could define Employment Status as:

\[ Employment \: Status = \left\{ \begin{array}{ll} 1 & \mbox{If Employed} \\ 2 & \mbox{If Job Hunting} \\ 3 & \mbox{If Not Looking For Work}\end{array} \right. \]

The placeholders for Employed (1), If Job Hunting (2), and If Not Looking For Work (3) are not meaningful of themselves, but define the bucket numerical placeholders to code the data. We could just have easily labelled the variable Employment Status as:

\[ Employment \: Status = \left\{ \begin{array}{ll} 7 & \mbox{If Employed} \\ 18 & \mbox{If Job Hunting} \\ 33 & \mbox{If Not Looking For Work}\end{array} \right. \]

In certain settings it is beneficial to convert numerical data such as Age, Income or Education (in years) to a categorical variable. For Education, we could define the number of years in school as:

\[ Education = \left\{ \begin{array}{ll} 1 & \mbox{If Less Than High School} \\ 2 & \mbox{If High School Degree} \\ 3 & \mbox{If More Than High School Degree} \end{array} \right. \]

Of course, there is not a limit on the number of categories. You could have grouped the Education variable in other ways as you see appropriate.

A special case of a categorical variable is an indicator variable, sometimes referred to as a binary or dummy variable. Here we could define Employment Status as simply Currently Employed:

\[ Currently \: Employed = \left\{ \begin{array}{ll} 1 & \mbox{If Employed} \\ 0 & \mbox{Otherwise} \end{array} \right. \]

This approach collapsed the two categories If Job Hunting and If Not Looking For Work into one category. Careful consideration when collapsing categories is needed, as you want the categories to be what we say homogeneous (or similar) so that your results are valid.

In this case, you could use Currently Employed as either a categorical or numeric variable.

Textual Data refer to data that are collected from writings or electronic databases. Methods concerning the mining of text data are beyond the scope of this book.