Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
avatar
Cloudera Employee

Table of Contents

Overview

In this article, let's build a real-time data visualization to analyze Twitter feeds using Cloudera Data Platform.

Design

 

 

Explanation:

  • NiFi Flow invokes Twitter API v2 (every 15 seconds), and stages all tweets in an AWS S3 bucket.
  • Hive External Tables points to the staging location (i.e. AWS S3 bucket).
  • Data Visualization uses Hive External Tables as its data source, to create visuals. All visuals are refreshed every 20 seconds.

Implementation

Prerequisites

Step #1 - Cloudera DataFlow (CDF)

  • Go to CDF user interface, and ensure CDF service is enabled in your CDP environment.

  • Import the following flow definition - nifi-twitter-flow.json

  • Select imported flow, click on Deploy, select the Target Environment and begin the deployment process.

  • During the deployment, it's going to ask about the following parameters that this NiFi Flow requires to function:

    • AWS - Access Key ID - visit Understanding and getting your AWS credentials if you're not clear on how to get it. Ensure that AWS IAM user you're using, has "AmazonS3FullAccess" permissions.
    • AWS - Secret Access Key - same instructions as AWS - Access Key ID.
    • AWS S3 Bucket - provide AWS S3 bucket name. Ensure that IAM user has access to this S3 bucket.
    • AWS S3 Bucket Subdirectory - provide subdirectory in AWS S3 bucket where you want to stage your tweets.

      It's usually best to delete any historical data from this subdirectory, so you are only staging latest tweets.

    • Twitter API v2 Bearer Token - provide your app's bearer token from Twitter's Developer Portal.
    • Twitter Search Term - provide the search term for which you want to do the analysis. For ex: COVID19, DellTechWorld, IntelON, etc. Only one search term is allowed at the moment.
  • Extra Small NiFi node size is enough for this data ingestion.

  • After deployment is done, you would be able to see the flow in Dashboard.

  • Open NiFi Flow to understand how it's working.

 

  • Notes are available in NiFi Flow to help you understand the use of each processor.

 

  • All NiFi Flow parameters can be updated while the flow is running, from Deployment Manager. As soon as you Apply Changes, running processors that are affected by the Parameter changes will automatically be restarted.

Step #2 - Cloudera Data Warehouse (CDW)

  • Go to CDW user interface. Ensure CDW service is activated in your CDP environment, and a Database Catalog & a Virtual Warehouse compute cluster are available for use.

  • In Hue editor, manually load ISO Language Codes into a table. Default settings in the importer wizard will work fine. If you're not sure how to upload data in Hue, visit Hue Importer -- Select a file, choose a dialect, create a table.

  • In Hue editor, execute twitter-queries.sql. This will create the necessary tables and views, required to support the visuals in the Twitter Dashboard. Please change AWS S3 location to where you've staged the tweets data.

  • After the query execution is successful, you will be able to validate tables using queries below.

    SELECT * FROM twtr.iso_language_codes a;
    SELECT * FROM twtr.tweets b;
    SELECT * FROM twtr.twtr_view c;
    SELECT * FROM twtr.tweets_by_minute d;

Step #3 - Data Visualization

  • Go to CDW user interface, select Data Visualization and add a new Data VIZ.

  • In Data Visualization user interface, create a new connection. You must be logged in as admin to create a new connection.

     

 

  • Now that you have a connection to Hive virtual warehouse, let's create two datasets required to support the visuals.

  • Create first dataset:

    • Dataset Title - Twitter View
    • Dataset Source - From Table
    • Select Database - twtr
    • Select Table - twtr_view
     

 

  • Create second dataset:

    • Dataset Title - Tweets By Minute
    • Dataset Source - From Table
    • Select Database - twtr
    • Select Table - tweets_by_minute
     

 

  • It's now time to Import Visual Artifacts. Take a quick look at Importing a dashboard if you're doing it for the first time.

     

 

 

  • Once you get the following screen, click ACCEPT AND IMPORT.

     

 

  • Twitter Dashboard should be successfully imported at this point. To see it, go to VISUALS from the top menu and select Twitter Dashboard.

  • Congratulations on creating your real-time Twitter Dashboard using Cloudera Data Platform!!! To learn more about its implementation, please register here to watch the recording.

     

     

1,012 Views