Community Articles

Find and share helpful community-sourced technical articles.
avatar
Contributor

When you are working with machine learning algorithms, you must be mindful with how the algorithm treats the data that you input into them. Often, there is a required transformation process to conform the desired values into a usable form. In the case of categorical data, ML algorithms will misinterpret raw values, which will lead to improper results. The solution to this using categorical data is known as One Hot Encoding.

This process entails creating an additional column for each of the values represented in the categorical data set. This followed by only setting one of the newly columns to value true or 1, while setting the rest of them to false or 0. In our example below, imagine these new columns as booleans with titles: Is_apple, Is_banana, and Is_coconut. Continuing our example, the value "banana" would have the values for columns Is_apple=0, Is_banana=1. and Is_coconut=0. Using the tools below, we will create a vector that can be used as input into ML algorithms.

Step 1 - Import Library and a Create DataFrame

After importing the proper libraries, the following code sections creates a dataframe to be used in the is example. This dataframe will act as our raw categorical data. One column must contain the categorical value, which are fruit names in this example, and have an ID value, which is mutually exclusive. In our example, the value 'apple' is only associated with the value 0.

import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}

val df = spark.createDataFrame(Seq(
  (0, "apple"),
  (1, "banana"),
  (2, "coconut"),
  (1, "banana"),
  (2, "coconut")
)).toDF("id", "fruit")

Step 2 - StringIndexer

The StringIndex function maps a string column of labels to an ML column of label indices. In our example, this code will traverse through the dataframe and create a matching index for each of the values in the fruit name column.

val indexer = new StringIndexer()
  .setInputCol("fruit")
  .setOutputCol("fruitIndex")
  .fit(df)

val indexed = indexer.transform(df)

Step 3 - OneHotEncoder

The OneHotEncoder function maps a column of category indices to a column of binary vectors. In our example, this code will convert the values into a binary vector and ensure only one of them is set to true or hot.

val encoder = new OneHotEncoder()
  .setInputCol("fruitIndex")
  .setOutputCol("fruitVec")

val encoded = encoder.transform(indexed)

Step 4 - Display results

This code will display the initial id value from the first step and compared to its associated output vector. This vector can be used represent categorical data as input to ML algorithms.

encoded.select("id", "fruitVec").show()
14,334 Views
Comments
avatar
Contributor

hello sir

How can I handle the TimeStamps fields using spark.

please I need help !

avatar
Contributor
  1. Something like this will do the trick assuming the column name is timestamp_s. This will create a new data frame with a new timestamp column added to it. The new timestamp format is defined by this string: "EEE MMM dd HH:mm:ss ZZZZZ yyyy"
  2. val df_second = df_First.withColumn("timestampZ", unix_timestamp($"timestamp_s","EEE MMM dd HH:mm:ss ZZZZZ yyyy").cast(TimestampType)).drop($"timestamp_s")