Created on 08-01-2018 07:20 PM - edited 09-16-2022 01:43 AM
There is a reason why I spent my life studying and working in computer science: understanding a computer's psychology is usually fairly straight forward. Indeed, when presented with a specific input, computer programs tend to respond in a very predictable way, as opposed to our fellow human beings. Of course, this observation goes out of the window as our algorithms become increasingly more complex and capable of learning.
Regardless, as much as I love computer science, I always had a keen interest in human sciences. Personality psychology is a fascinating subject that has seen its ups and downs as any science topic. At the center of personality psychology reside the big five personality traits:
This taxonomy was determined by applying statistical models to personality surveys, essentially clustering results of surveys of people describing fellow human beings. As such, these traits are meant to categorize common aspect of personality across human beings without moral connotation. The validity of the model and its predictability for real life outcomes is of course controversial, and I wouldn't make it justice here (I most likely already irritated any personality psychologist that read these first few lines).
Recently, multiple machine learning algorithms have been designed to determine these 5 personality traits from texts have surfaced, including IBM Watson personality insights. For this article I chose to use the personality recognizer written by Francois Mairesse, and automate personality detection of New York Times articles using HDF 3.1 and HDP 3.0.
The solution put in place uses 3 main elements:
The figure below gives an overview of the solution flow:
More precisely, the solution can be dissected in 5 main steps, that I'm describing in details below:
This step is very straight forward. Go to https://developer.nytimes.com/signup and sign-up for a key:
Note: The New-York Times API is for non-commercial use only. I could have of course used any news API, but I'm not creative.
The InvokeHTTP is used here with all default parameters, except for the URL. Here are some key configuration items and a screenshot of the Processor configuration:
The API call parameter page=0 returns results 0-9; for this exercise, I'm only interested in the latest article, so I setup an evaluateJSONpath processor take care of that, as you can see below:
A few important points here:
This step is definitely not optimized. The point here is to escape the special characters to avoid errors when inserting into hive. The only thing I am doing here is removing the ' from the snippet as you can see, but it would deserve a second path I think:
Once we retrieved the meta data of the article, we must obtain the actual text of the article. For this, I'm using boilerpipe, an open source boilerplate removal and fulltext extraction from HTML pages (see reference for details).
After downloading the boilerpipe jars (using http://www.java2s.com/Code/Jar/b/Downloadboilerpipe120jar.htm), use your favorite Java IDE and create this simple class:
import de.l3s.boilerpipe.BoilerpipeProcessingException; import de.l3s.boilerpipe.extractors.ArticleExtractor; import java.net.MalformedURLException; import java.net.URL; public class extractArticle { public static void main (String args[]) throws MalformedURLException, BoilerpipeProcessingException { if(args.length == 1) { URL url = new URL("" + args[0]); String text = ArticleExtractor.INSTANCE.getText(url); System.out.println(text); } else { System.out.println("Please Specify URL"); } } }
Once tested, create an executable jar (in my case extractArticle.jar).
Connect to your nifi server with your nifi user and create the following directory structure:
$ cd /home/nifi $ mkdir extractArticle $ cd extractArticle $ mkdir lib
Transfer the following libraries to ~/extractArticle/lib/ :
Under ~/extractArticle/ create the script extract_article.sh as follows:
#!/bin/bash JDK_PATH=/usr LIB1=./lib/xerces-2.9.1.jar LIB2=./lib/nekohtml-1.9.13.jar LIB3=./lib/boilerpipe-sources-1.2.0.jar LIB4=./lib/boilerpipe-javadoc-1.2.0.jar LIB5=./lib/boilerpipe-demo-1.2.0.jar LIB6=./lib/boilerpipe-1.2.0.jar LIB7=./lib/extractArticle.jar LIBS=$LIB1:$LIB2:$LIB3:$LIB4:$LIB5:$LIB6:$LIB7 $JDK_PATH/bin/java -Xmx512m -classpath $LIBS extractArticle $*
Configure the processor to pass the URL in argument and outputting the output stream to the next processor, as follows:
Just as for boilerpipe, we're going to run an ExecuteStream command. To prepare the files, run the following commands:
$ cd /home/nifi $ wget http://farm2.user.srcf.net/research/personality/recognizer-1.0.3.tar.gz recognizer-1.0.3.tar.gz $ tar -xvf recognizer-1.0.3.tar.gz $ cd PersonalityRecognizer $ mkdir texts
Modify the file PersonalityRecognizer.properties as follows:
################################################## # Configuration File of the Personality Recognizer ################################################## # All variables should be modified according to your # directory structure # Warning: for Windows paths, backslashes need to be # doubled, e.g. c:\Program Files\Recognizer # Root directory of the application appDir = /home/nifi/PersonalityRecognizer # Path to the LIWC dictionary file (LIWC.CAT) liwcCatFile = ./lib/LIWC.CAT # Path to the MRC Psycholinguistic Database file (mrc2.dct) mrcPath = ./ext/mrc2.dct
Modify the script PersonalityRecognizer as follows:
#! /bin/bash - # ENVIRONMENT VARIABLES JDK_PATH=/usr WEKA=./ext/weka-3-4/weka.jar # ---------------------------------- COMMONS_CLI=./lib/commons-cli-1.0.jar MRC=./lib/jmrc.jar LIBS=.:$WEKA:$COMMONS_CLI:$MRC:bin/ $JDK_PATH/bin/java -Xmx512m -classpath $LIBS recognizer.PersonalityRecognizer $*
Finally, create a wrapper script that, using the latest file from the folder texts runs PersonalityRecognizer and outputs only the results in a json format:
#!/bin/bash text=`ls -t texts/ | head -1` ./PersonalityRecognizer -i ./texts/$text > tmp.txt extraversion=`cat tmp.txt | grep extraversion | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'` emotional_stability=`cat tmp.txt | grep emotional | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'` agreeableness=`cat tmp.txt | grep agreeableness | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'` conscientiousness=`cat tmp.txt | grep conscientiousness | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'` openness_to_experience=`cat tmp.txt | grep openness | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'` json_output="{\"web_url\" : \"$1\", \"extraversion\" : \"$extraversion\",\"emotional_stability\" : \"$emotional_stability\",\"agreeableness\" : \"$agreeableness\",\"conscientiousness\" : \"$conscientiousness\",\"openness_to_experience\" : \"$openness_to_experience\"}" echo $json_output rm tmp.txt texts/*
This processor takes the output stream of the HTML scraping to create a file, under the appropriate folder, as shown below:
Just as for HTML scraping, configure the processor to pass the URL in argument and outputting the output stream to the next processor, as follows:
Using EvaluateJSONPath, retrieve the results of the PersonalityRecognizer to attributes:
Because we don't control wether we receive the same article twice from the New York Times API, we need to make sure that we don't insert the same data twice into Hive (i.e. upsert data into Hive). Upsert can be implemented by two tables and the merge command.
Therefore connect to your hive server and create one database and two tables as follows:
CREATE DATABASE personality_detection; use personality_detection; CREATE TABLE text_evaluation ( web_url String, snippet String, byline String, pub_date date, headline String, document_type String, news_desk String, last_updated String, extraversion decimal(10,4), emotional_stability decimal(10,4), agreeableness decimal(10,4), conscientiousness decimal(10,4), openness_to_experience decimal(10,4) ) clustered by (web_url) into 2 buckets stored as orc tblproperties("transactional"="true"); CREATE TABLE all_updates ( web_url String, snippet String, byline String, pub_date date, headline String, document_type String, news_desk String, last_updated String, extraversion decimal(10,4), emotional_stability decimal(10,4), agreeableness decimal(10,4), conscientiousness decimal(10,4), openness_to_experience decimal(10,4) ) STORED AS ORC tblproperties ("orc.compress" = "SNAPPY");
Using a ReplaceText processor, create the appropriate HiveQL command to be executed to upsert data into your tables from the data collected in the flow.
Code for Replacement Value (note that I remove the timestamp from the pub_date here, because I'm storing it as a date):
use personality_detection; insert into all_updates values('${web_url}','${snippet}','${byline}','${pub_date:substring(0,10)}','${headline}','${document_type}','${news_desk}','${now()}','${extraversion}','${emotional_stability}','${agreeableness}','${conscientiousness}','${openness_to_experience}'); merge into text_evaluation using (select distinct web_url, snippet, byline, pub_date, headline, document_type, news_desk, extraversion, emotional_stability, agreeableness, conscientiousness, openness_to_experience from all_updates) all_updates on text_evaluation.web_url = all_updates.web_url when matched then update set snippet=all_updates.snippet, byline=all_updates.byline, pub_date=all_updates.pub_date, headline=all_updates.headline, document_type=all_updates.document_type, news_desk=all_updates.news_desk, last_updated=from_unixtime(unix_timestamp()), extraversion=all_updates.extraversion, emotional_stability=all_updates.emotional_stability, agreeableness=all_updates.agreeableness, conscientiousness=all_updates.conscientiousness, openness_to_experience=all_updates.openness_to_experience when not matched then insert values(all_updates.web_url,all_updates.snippet, all_updates.byline, all_updates.pub_date, all_updates.headline, all_updates.document_type, all_updates.news_desk, from_unixtime(unix_timestamp()), all_updates.extraversion, all_updates.emotional_stability, all_updates.agreeableness, all_updates.conscientiousness, all_updates.openness_to_experience); truncate table all_updates;
Processor Overview:
Finally, configure a simple PutHiveQL processor as follows (make sure you configured your HiveConnectionPool beforehand):
Lastly, after running the NiFi flow for a while, create a simple Zeppelin notebook to show your result. This notebook will use the jdbc interpreter for Hive and run the following query:
%jdbc(hive) select byline, extraversion, emotional_stability, agreeableness, conscientiousness, openness_to_experience from personality_detection.text_evaluation limit 10
Then, you can play with Zeppelin visualizations to display the average of the big 5 by byline:
While being a very simple, this exercise is a good starting point for on-the-wire personality recognition. More importantly, in an age of information overload or even misinformation, having the ability to classifying the psychology of a text on the fly can be extremely useful. I do plan on tinkering with this project, improving performance, optimizing models and ingesting more data, so stay tuned!