Support Questions

Report Inappropriate Content · ‎12-05-2017

I am a complete newbie to the Hadoop Ecosystem in general, so please be patient with me.

The goal of my project is to 'apply some sort of machine learning algorithm to identify why a system fails'. Now there are millions of these systems each producing some 5000 data digits every second. A good way to visualize this is to imagine an excel spreadsheet with 1 million rows and 5000 columns. Of which the first cell of each row is the system name, and the next 5000 columns of the same row are numbers generated by the system per second. Times that by a million system.

Now I have that data in a MongoDB. After some research, and from what I understood, I can import data from MongoDB to Horton works via a connector. I am then able to remove, clean and filter some columns, of which I can then apply machine learning algorithms to. The result is some sort of visualization of the accuracy of the algorithm. This will be an iterative process between the machine learning and visualization part until the correct algorithm is found. I am thinking of the possibility of building a neural network which can use this feedback to adjust the nodes weights to improve the accuracy of the algorithms.

My question is more on the thought process, and if I am thinking in the right direction? And is Horton works the best in this? I would like to know the opinion of the Horton works community 🙂

TimothySpann · ‎12-11-2017

Apache NiFi can ingest and clean up this MongoDB and monitor for errors. Machine Learning and Deep Learning flows can be trigger from Apache NiFi via Apache Livy for Apache Spark ML and also Apache MXNet and TensorFlow deep learning.

This can also be done via Kafka/S2S and other streaming mechanisms.

https://community.hortonworks.com/content/kbentry/53554/using-apache-nifi-100-with-mongodb.html

https://community.hortonworks.com/content/kbentry/146198/data-flow-enrichment-with-nifi-part-3-looku...

https://community.hortonworks.com/articles/148730/integrating-apache-spark-2x-jobs-with-apache-nifi....

View solution in original post

arald · ‎12-11-2017

Hi Victor,

you don't really expect an answer on 'Is Hortonworks the best in this?' in a Hortonworks forum? Of course it is the best 🙂

But to the main topic, I am not sure if I got the task or project right, do you try to predict that a system will fail in near future or are you looking at failed systems and you try to identify/categorize what happened before the failure?

The first task is something which you could implement with ML procedures/algorithms. For the latter I am not that sure, to me it sounds more like a data mining task.

Regards
Harald

TimothySpann · ‎12-11-2017