Community Articles

Find and share helpful community-sourced technical articles.
Celebrating as our community reaches 100,000 members! Thank you!
Rising Star


Time to continue my series on news author personality detection! This tutorial will look at data science, Zeppelin notebooks with the idea of predicting the popularity of an article based on the personality of an author. This tutorial does assume that you completed part 1 and part 2 of the series. After completing this part, your architecture will look like this:


This tutorial will be divided in sections:

  • Section 1: Update Nifi to send article data to Phoenix
  • Section 2: Read Phoenix data with Spark in Zeppelin Note
  • Section 3: Train Linear Regression Model
  • Section 4: Make and store article popularity predictions

Note: This exercise is a very simplified way to apply machine learning. I spent very little time selecting the appropriate algorithm, accumulating enough data or even challenging whether or not the personality of an author is a viable predictor for personality. This is a full time job, it's called being a data scientist! But hopefully this will enable the real data scientist to please stand up and build useful stuff using the Hortonworks stack.

Section 1: Update Nifi to send article data to Phoenix

This section's goal is to first enhance our previous Nifi flow to upsert data into HBase using Phoenix, as well as add a new flow collecting data from the New York Times "most popular" API. You can find all these flows under my github here. I used Nifi registry to push directly to github; if you want to learn how to do that, check out this article: How to configure git for Nifi Registry in HDF 3.2.

Step 1: Create the appropriate tables using Phoenix

Connect to your server and run phoenix-sqlline (with the appropriate user):

$ sudo su hbase
$ phoenix-sqlline

Create three tables:

  • One to store author personality evaluation per article
  • One to store article popularity
  • One to store popularity prediction based on author personality

Here is the SQL create code:

CREATE TABLE article_evaluation (
    snippet VARCHAR,
    byline  VARCHAR,
    pub_date TIMESTAMP,
    headline VARCHAR,
    document_type  VARCHAR,
    news_desk  VARCHAR,
    filename  VARCHAR,
    last_updated VARCHAR,
    extraversion DOUBLE,
    emotional_stability DOUBLE,
    agreeableness DOUBLE,
    conscientiousness DOUBLE,
    openness_to_experience DOUBLE
CREATE TABLE article_stats (
views TINYINT,
last_updated VARCHAR
CREATE TABLE article_popularity_prediction (
extraversion DOUBLE,
emotional_stability DOUBLE,
agreeableness DOUBLE,
conscientiousness DOUBLE,
openness_to_experience DOUBLE,
prediction DOUBLE

Step 2: Update your existing flow to upsert to Phoenix

Our goal is to add the following processors to our previous flow (established in part 1 of this series):


Update Attribute: Escape characters

Create a processor and configure it to create/update 3 attributes:

  • filename: filename used to store the article text into a hdfs file (optional and not part of this tutorial, but the field is needed for the upsert)
  • headline: article headline
  • snippet: article snippet


Here is the code for each attribute:







Replace Text: Create Upsert query

This one is straight forward, as depicted below:


Your replacement value should be:

upsert into article_evaluation values('${web_url}','${snippet}','${byline}','${pub_date}','${headline}','${document_type}','${news_desk}','${filename}','${now()}',${mairessepersonalityrecognition.extraversion},${mairessepersonalityrecognition.emotional_stability},${mairessepersonalityrecognition.agreeableness},${mairessepersonalityrecognition.conscientiousness},${mairessepersonalityrecognition.openness_to_experience})

PutSQL: Run Query

This processor is straight forward too, but will require to first create a DBCPConnectionPool controller service:


Assuming the nifi node has a phoenix client, here is the parameters you will need to setup:

Database Connection URL


Database Driver Class Name


Database Driver Location(s)


Finally, configure a PutSQL as depicted below. You will note that I turned off the support for fragmented transactions. This is not important here but will in the next flow where we split the output of the API.


Step 3: Create a new flow for article popularity

The flow itself is pretty self explanatory, as shown below:


Invoke HTTP: Call Most Popular NYT API


Remote URL should be (to get an API key, see part 1):[YOUR_API_KEY]

SplitJson: Split Response into 10 flow files


JsonPath Expression should be:


EvaluateJsonPath: Extract Relevant Attributes


Extract url from JSON:


Extract views from JSON:


ReplaceText: Create Phoenix Query


Replacement Value:

upsert into article_stats values('${url}',${views},'${now()}')

PutSQL: Run Phoenix Upsert SQL


This is the exact same processor as in your other flow :)

Section 2: Read Phoenix data with Spark in Zeppelin Note

Step 1: Download the latest phoenix Spark 2 Jar

Go to the maven repository and get the latest jar for phoenix and spark2 (see repository here).

Once downloaded add it to the spark2 libraries, under the following folder: /usr/hdp/current/spark2-client/jars.

In this directory copy the phoenix client if it's not present yet from /usr/hdp/current/phoenix-client/.

$ cd /usr/hdp/current/spark2-client/jars
$ wget
$ cp /usr/hdp/current/phoenix-client/phoenix- .
$ ls -ltra phoenix*
-rw-r--r--. 1 root root 222910655 Oct  2 17:35 phoenix-
-rwxr-xr-x. 1 root root     83908 Oct  2 17:35 phoenix-spark2-

Restart spark from Ambari.

Step 2: Create a new note to retrieve data from Phoenix

Open Zeppelin and create a new note (I named mine Linear Regression). Using pyspark, retrieve data from the article_evaluation and article_stats tables into a dataframe, and join them based on the web_url:

from pyspark.sql.functions import lit
article_evaluation_table =   .format("org.apache.phoenix.spark")   .option("table", "article_evaluation")   .option("zkUrl", "localhost:2181")   .load()
article_stats_table =   .format("org.apache.phoenix.spark")   .option("table", "article_stats")   .option("zkUrl", "localhost:2181")   .load()
aet = article_evaluation_table.alias('aet')
ast = article_stats_table.alias('ast')
inner_join = aet.join(ast, aet.WEB_URL == ast.WEB_URL)

Section 3: Train Linear Regression Model

In the same note, create a paragraph that creates a vectorized dataframe for your model to use:

from import VectorAssembler
vectorAssembler = VectorAssembler(inputCols = ['EXTRAVERSION', 'OPENNESS_TO_EXPERIENCE', 'CONSCIENTIOUSNESS', 'AGREEABLENESS', 'EMOTIONAL_STABILITY'], outputCol = 'features')
varticle_df = vectorAssembler.transform(article_df)
varticle_df =['features', 'VIEWS'])
vno_views_df = vectorAssembler.transform(no_views_df)
train_df = varticle_df

Use the dataframe created to train a linear regression model:

from import LinearRegression
lr = LinearRegression(featuresCol = 'features', labelCol='VIEWS', maxIter=10, regParam=0.3, elasticNetParam=0.8)
lr_model =
print("Coefficients: " + str(lr_model.coefficients))
print("Intercept: " + str(lr_model.intercept))

Check the summary of your model:

trainingSummary = lr_model.summary
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)

Section 4: Make and store article popularity predictions

Run Your model on all articles in the article evaluation table:

predictions = lr_model.transform(vno_views_df)

Upload these results to the article_popularity_prediction table using phoenix

df_to_write.write   .format("org.apache.phoenix.spark")   .mode("overwrite")   .option("table", "article_popularity_prediction")   .option("zkUrl", "localhost:2181")   .save()

Finally, if you want to check your predictions versus the actual view, create a jdbc(phoenix) paragraph, as follows:

select  article_popularity_prediction.web_url as URL,
        article_popularity_prediction. prediction as predicted_views,
from    article_popularity_prediction, article_stats
where   article_popularity_prediction.web_url = article_stats.web_url

As you can see the model is far from perfect. I would also argue that creating a PMML from the model would have been a better way to go for our architecture (calling the PMML from Nifi as data flows), instead of loading results offline like we do here. Again, this is a full time job 🙂 See you next article, where we will create a dashboard to leverage this data science and data engineering we just created!


Note: for reference, I included my Zeppelin note here: linear-regression-zeppelin-note.json