Created on 05-10-2018 02:44 PM
This article is based on the following Kaggle competition: https://www.kaggle.com/arjunjoshua/predicting-fraud-in-financial-payment-services
It is a Scala-based implementation of the data science exploration written in Python. In addition to training a model, we also have the ability to batch-evaluate a set of data stored in a file through the trained model.
Full configuration, build, and installation instructions can be found at the GitHub repo: https://github.com/anarasimham/anomaly-detection
When you execute the model training, you'll get various lines of output as the data is cleaned and the model is built. To view this output, use the link provided by the Spark job console output. This will look like the following:
18/05/10 14:33:58 INFO Client: client token: N/A diagnostics: N/A ApplicationMaster host: <HOST_IP> ApplicationMaster RPC port: 0 queue: default start time: 1525962717635 final status: SUCCEEDED tracking URL: http://<SPARK_SERVER_HOST_NAME>:8088/proxy/application_1525369563028_0053/ user: root
The last few lines, which show the trained model, look like this:
+-----+-------------------------------------------------------------------------------------------------------------+-----------------------------------------+----------+ |label|features |probabilities |prediction| +-----+-------------------------------------------------------------------------------------------------------------+-----------------------------------------+----------+ |0 |(10,[0,2,5,8,9],[1.0,1950.77,106511.31,-1950.77,104560.54]) |[0.9777384996414185,0.02226151153445244] |0.0 | |0 |(10,[0,2,5,8,9],[1.0,3942.44,25716.56,-3942.44,21774.120000000003]) |[0.9777384996414185,0.02226151153445244] |0.0 | |0 |[1.0,0.0,7276.69,93.0,0.0,1463.0,0.0,0.0,-7183.69,-5813.69] |[0.9777384996414185,0.02226151153445244] |0.0 | |0 |(10,[0,2,5,8,9],[1.0,13614.91,30195.0,-13614.91,16580.09]) |[0.9777384996414185,0.02226151153445244] |0.0 | |0 |[1.0,0.0,17488.56,14180.0,0.0,182385.22,199873.79,0.0,-3308.5600000000013,-34977.130000000005] |[0.9777384996414185,0.02226151153445244] |0.0 | |0 |[1.0,0.0,19772.53,0.0,0.0,44486.99,64259.52,0.0,-19772.53,-39545.06] |[0.9777384996414185,0.02226151153445244] |0.0 | |1 |(10,[0,2,3,7,9],[1.0,20128.0,20128.0,1.0,-20128.0]) |[0.022419333457946777,0.9775806665420532]|1.0 | |0 |[1.0,0.0,33782.98,0.0,0.0,39134.79,16896.7,0.0,-33782.98,-11544.890000000003] |[0.9777384996414185,0.02226151153445244] |0.0 | |0 |[1.0,0.0,34115.82,32043.0,0.0,245.56,34361.39,0.0,-2072.8199999999997,-68231.65] |[0.9777384996414185,0.02226151153445244] |0.0 |
The original data is split into training data and test data, and the above is the results of running the test data through the model.
And there you have it, a straightforward approach to building a Gradient Boosted Decision Tree Machine Learning model based off of financial data. This approach can be applied not only to Finance but can be used to train a whole variety of use cases in other industries.