Community Articles

pvidal · ‎08-23-2018

Introduction

A few weeks ago, I published an article called Determining the big 5 traits of Personality Psychology of news articles using NiFi, Hive & Zeppelin. Since then, I have worked diligently to improve on this first iteration, with the objective to mock up at the heart of every company today: create an end-to-end platform that uses machine learning to not only generate insights, but keeps improving and feeding consumer applications.

While doing this work was a way for me to get into the needy-greedy of the latest Hortonworks tools, I decided to share it with the world in the form of a series of tutorial articles, because I believe it is a great way to get familiar with the stack.

Architecture overview

Luckily, the Hortonworks platform has all the elements needed to create this end-to-end platform. The figure below gives an overview of this series of articles architecture:

As you can see, the goal of this platform is to:

Ingest data from news articles (directly from the NYT API at first, then from other RSS feeds)
Using Nifi and SAM, read the meta-data of the extracted articles, scrape their content, run personality recognition on their authors, then expose the result via Kafka for Druid consumption, directly pushing to HBase/Phoenix for offline analytics and "micro" services for consumer applications
Generate real time insights on this computed data via Druid and Superset
Enable Analytics & model training on the data stored in HBase using Zeppelin notebooks & Spark, that would then feed back the personality recognition modes
Enable custom application to consumer the data extracted and analyzed

Agenda

This series of article will be composed of 4 parts:

Cloudera Community

Community Articles

News Authors Personality Detection - End-to-end data science & engineering platform with Nifi, Kafka, SAM, Druid, Superset, Hive, Zeppelin, Spark

Apache Hive

Apache NiFi

Cloudera Data Science Workbench (CDSW)

Introduction

Architecture overview

Agenda