When you are working with machine learning algorithms, you must be mindful with how the algorithm treats the data that you input into them. Often, there is a required transformation process to conform the desired values into a usable form. In the case of categorical data, ML algorithms will misinterpret raw values, which will lead to improper results. The solution to this using categorical data is known as One Hot Encoding.
This process entails creating an additional column for each of the values represented in the categorical data set. This followed by only setting one of the newly columns to value true or 1, while setting the rest of them to false or 0. In our example below, imagine these new columns as booleans with titles: Is_apple, Is_banana, and Is_coconut. Continuing our example, the value "banana" would have the values for columns Is_apple=0, Is_banana=1. and Is_coconut=0. Using the tools below, we will create a vector that can be used as input into ML algorithms.
Step 1 - Import Library and a Create DataFrame
After importing the proper libraries, the following code sections creates a dataframe to be used in the is example. This dataframe will act as our raw categorical data. One column must contain the categorical value, which are fruit names in this example, and have an ID value, which is mutually exclusive. In our example, the value 'apple' is only associated with the value 0.
The StringIndex function maps a string column of labels to an ML column of label indices. In our example, this code will traverse through the dataframe and create a matching index for each of the values in the fruit name column.
val indexer = new StringIndexer()
val indexed = indexer.transform(df)
Step 3 - OneHotEncoder
The OneHotEncoder function maps a column of category indices to a column of binary vectors. In our example, this code will convert the values into a binary vector and ensure only one of them is set to true or hot.
val encoder = new OneHotEncoder()
val encoded = encoder.transform(indexed)
Step 4 - Display results
This code will display the initial id value from the first step and compared to its associated output vector. This vector can be used represent categorical data as input to ML algorithms.