Community Articles

anarasimham · ‎05-10-2018

This article is based on the following Kaggle competition: https://www.kaggle.com/arjunjoshua/predicting-fraud-in-financial-payment-services

It is a Scala-based implementation of the data science exploration written in Python. In addition to training a model, we also have the ability to batch-evaluate a set of data stored in a file through the trained model.

Full configuration, build, and installation instructions can be found at the GitHub repo: https://github.com/anarasimham/anomaly-detection

When you execute the model training, you'll get various lines of output as the data is cleaned and the model is built. To view this output, use the link provided by the Spark job console output. This will look like the following:

18/05/10 14:33:58 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: <HOST_IP>
ApplicationMaster RPC port: 0
queue: default
start time: 1525962717635
final status: SUCCEEDED
tracking URL: http://<SPARK_SERVER_HOST_NAME>:8088/proxy/application_1525369563028_0053/
user: root

The last few lines, which show the trained model, look like this:

+-----+-------------------------------------------------------------------------------------------------------------+-----------------------------------------+----------+
|label|features                                                                                                     |probabilities                            |prediction|
+-----+-------------------------------------------------------------------------------------------------------------+-----------------------------------------+----------+
|0    |(10,[0,2,5,8,9],[1.0,1950.77,106511.31,-1950.77,104560.54])                                                  |[0.9777384996414185,0.02226151153445244] |0.0       |
|0    |(10,[0,2,5,8,9],[1.0,3942.44,25716.56,-3942.44,21774.120000000003])                                          |[0.9777384996414185,0.02226151153445244] |0.0       |
|0    |[1.0,0.0,7276.69,93.0,0.0,1463.0,0.0,0.0,-7183.69,-5813.69]                                                  |[0.9777384996414185,0.02226151153445244] |0.0       |
|0    |(10,[0,2,5,8,9],[1.0,13614.91,30195.0,-13614.91,16580.09])                                                   |[0.9777384996414185,0.02226151153445244] |0.0       |
|0    |[1.0,0.0,17488.56,14180.0,0.0,182385.22,199873.79,0.0,-3308.5600000000013,-34977.130000000005]               |[0.9777384996414185,0.02226151153445244] |0.0       |
|0    |[1.0,0.0,19772.53,0.0,0.0,44486.99,64259.52,0.0,-19772.53,-39545.06]                                         |[0.9777384996414185,0.02226151153445244] |0.0       |
|1    |(10,[0,2,3,7,9],[1.0,20128.0,20128.0,1.0,-20128.0])                                                          |[0.022419333457946777,0.9775806665420532]|1.0       |
|0    |[1.0,0.0,33782.98,0.0,0.0,39134.79,16896.7,0.0,-33782.98,-11544.890000000003]                                |[0.9777384996414185,0.02226151153445244] |0.0       |
|0    |[1.0,0.0,34115.82,32043.0,0.0,245.56,34361.39,0.0,-2072.8199999999997,-68231.65]                             |[0.9777384996414185,0.02226151153445244] |0.0       |

The original data is split into training data and test data, and the above is the results of running the test data through the model.

The "label" column denotes which label (0=legitimate, 1=fraudulent) the row of data truly falls into
The "features" column is all the data that went into training the model, in vectorized format because that is the way the model understands the data
The "probabilities" column denotes how likely the model thinks each of the labels is (first number being 0, second number being 1), and the "prediction" column is what the model thinks the data falls into. You can add additional print statements and re-run the training to explore

When you execute the evaluation portion of this project (instructions in the GitHub repo), you will re-load the model from disk and use test data from a file to see if the model is predicting correctly. Note that it is a bad practice to use test data from the training set (like I have) but for simplicity I have done that. You can go to the Spark UI as above to view the output.

And there you have it, a straightforward approach to building a Gradient Boosted Decision Tree Machine Learning model based off of financial data. This approach can be applied not only to Finance but can be used to train a whole variety of use cases in other industries.

Cloudera Community

Community Articles

Anomaly Detection in Finance - Using Spark Scala and the XGBoost Modeling Library to Detect Fraud

Apache Spark