Community Articles

Find and share helpful community-sourced technical articles.
Welcome to the upgraded Community! Read this blog to see What’s New!
Cloudera Employee


A few weeks ago, I published an article called Determining the big 5 traits of Personality Psychology of news articles using NiFi, Hive & Zeppelin. Since then, I have worked diligently to improve on this first iteration, with the objective to mock up at the heart of every company today: create an end-to-end platform that uses machine learning to not only generate insights, but keeps improving and feeding consumer applications.

While doing this work was a way for me to get into the needy-greedy of the latest Hortonworks tools, I decided to share it with the world in the form of a series of tutorial articles, because I believe it is a great way to get familiar with the stack.

Architecture overview

Luckily, the Hortonworks platform has all the elements needed to create this end-to-end platform. The figure below gives an overview of this series of articles architecture:


As you can see, the goal of this platform is to:

  1. Ingest data from news articles (directly from the NYT API at first, then from other RSS feeds)
  2. Using Nifi and SAM, read the meta-data of the extracted articles, scrape their content, run personality recognition on their authors, then expose the result via Kafka for Druid consumption, directly pushing to HBase/Phoenix for offline analytics and "micro" services for consumer applications
  3. Generate real time insights on this computed data via Druid and Superset
  4. Enable Analytics & model training on the data stored in HBase using Zeppelin notebooks & Spark, that would then feed back the personality recognition modes
  5. Enable custom application to consumer the data extracted and analyzed


This series of article will be composed of 4 parts:

Version history
Last update:
‎09-16-2022 01:43 AM
Updated by: