- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Created on 08-23-2018 02:08 PM - edited 09-16-2022 01:43 AM
Introduction
A few weeks ago, I published an article called Determining the big 5 traits of Personality Psychology of news articles using NiFi, Hive & Zeppelin. Since then, I have worked diligently to improve on this first iteration, with the objective to mock up at the heart of every company today: create an end-to-end platform that uses machine learning to not only generate insights, but keeps improving and feeding consumer applications.
While doing this work was a way for me to get into the needy-greedy of the latest Hortonworks tools, I decided to share it with the world in the form of a series of tutorial articles, because I believe it is a great way to get familiar with the stack.
Architecture overview
Luckily, the Hortonworks platform has all the elements needed to create this end-to-end platform. The figure below gives an overview of this series of articles architecture:
As you can see, the goal of this platform is to:
- Ingest data from news articles (directly from the NYT API at first, then from other RSS feeds)
- Using Nifi and SAM, read the meta-data of the extracted articles, scrape their content, run personality recognition on their authors, then expose the result via Kafka for Druid consumption, directly pushing to HBase/Phoenix for offline analytics and "micro" services for consumer applications
- Generate real time insights on this computed data via Druid and Superset
- Enable Analytics & model training on the data stored in HBase using Zeppelin notebooks & Spark, that would then feed back the personality recognition modes
- Enable custom application to consumer the data extracted and analyzed
Agenda
This series of article will be composed of 4 parts: