It is a Scala-based implementation of the data science exploration written in Python. In addition to training a model, we also have the ability to batch-evaluate a set of data stored in a file through the trained model.
When you execute the model training, you'll get various lines of output as the data is cleaned and the model is built. To view this output, use the link provided by the Spark job console output. This will look like the following:
The original data is split into training data and test data, and the above is the results of running the test data through the model.
The "label" column denotes which label (0=legitimate, 1=fraudulent) the row of data truly falls into
The "features" column is all the data that went into training the model, in vectorized format because that is the way the model understands the data
The "probabilities" column denotes how likely the model thinks each of the labels is (first number being 0, second number being 1), and the "prediction" column is what the model thinks the data falls into. You can add additional print statements and re-run the training to explore
When you execute the evaluation portion of this project (instructions in the GitHub repo), you will re-load the model from disk and use test data from a file to see if the model is predicting correctly. Note that it is a bad practice to use test data from the training set (like I have) but for simplicity I have done that. You can go to the Spark UI as above to view the output.
And there you have it, a straightforward approach to building a Gradient Boosted Decision Tree Machine Learning model based off of financial data. This approach can be applied not only to Finance but can be used to train a whole variety of use cases in other industries.