Community Articles

vnv · ‎05-17-2017

If you haven't already see the 1st tutorial I made which guides you through the setup of rapidminer to read from hive.

We will pick up where it leaves off.

add a "set role" operator next to the your "retrieve from hive operator" that is located in the "Radoop Nest".

This allows you to select the column you wish to use in the model.

In this case I set the name field to the category column in my dataset.

You can obtain this dataset here: data

This is just for illustrative purposes so if you have data that has labels already feel free to use in place of this.

Now add a "split validation" operator and connect the ports.

Then double click the validation operator.

Add a "decision tree" operator on the left pane and add an "apply model" and "performance" operator and connect them all.

For performance select accuracy or whatever you wish to check.

If you are using the sample data provided in this tutorial you will see some errors.

Click on the error icons on each operator and select quick fix and apply.

Your panes should look like this:

You can modify this to run on Spark if you have spark on your cluster by using the "Spark Decision Tree" Operator.

That's how you can set up and train a model in Rapidminer.

Here we just used Decision tree but there are several algorithms to choose from.

Data Science using Hive data in Rapidminer