Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Super Guru

Often lines of business, individual users or shared teams will use online Google Sheets to share spreadsheet and tabular data amongst teams or without outside vendors. It's quick and easy to add sheets and store your data in Google Drive as spreadsheets.

Often you will want to consolidate, federate, analyze, enrich and use this data for reporting and dashboards throughout your organization.

An easy way to do that is to read in the data using Google's Sheet API. This is a standard SSL HTTP REST API that returns clean JSON data.

8464-googlesheets1.png

I created a simple Google Sheet to test ingesting a Google Sheet with HDF. You will need to enable Google Sheets API in the Google APIs Console. You must be logged into Google and have a Google Account (use the one where you created your Google Spreadsheets).

8494-sheetsapi.png

Google Documentation

Google provides a few Quick starts that you can use to ingest this data: https://developers.google.com/sheets/quickstart/js

or https://developers.google.com/sheets/quickstart/python. I chose to ingest this data the easiest way with a simple REST call from NIFI.

Testing Your Queries in Google's API Explorer

To test your queries and get your exact URL, go to Google's API Explorer: https://developers.google.com/apis-explorer/#p/sheets/v4/

GET https://sheets.googleapis.com/v4/spreadsheets/1sbMyDocID?includeGridData=true&key=MYKEYISFROMGOOGLE

Where 1sb… is the document id that comes from the name you see in your google sheet page like so: https://docs.google.com/spreadsheets/d/1UMyDocumentId/edit#g.

Calling the API From HDF 2.0

The one thing you will need is to setup a StandardSSLContextService to read in HTTPS data.

8465-sheets2.png

You will need to grab the truststore file cacerts for the JRE that NiFi is using to run.

8469-sheetsssl.png

8468-sheetscert.png

By default the Truststore Password is changeit. You really should change it.

Once you have an SSL configuration setup, then you can do a GetHTTP. You add in the Sheets GoogleAPI URL that includes the Sheet ID. I also set the User Agent, Accept Content-type and Follow Redirects = True.

8467-sheets4get.png

Now that we have SSL enabled, we can make our call to Google. The flow below is pretty simple.

8484-sheetsnifi.png

Now that I have ingested the Google Sheet, I can store it as JSON in my data lake. You could process this in HDF many ways including taking out fields, enriching with other data sources, converting to AVRO or ORC, storing in a HIVE table, Phoenix or HBase.

8481-sheetshdfs.png

You have now ingested Google Sheet data. Determining what you want to do to it and parsing out the JSON is a fun exercise. You can use an EvaluateJsonPath processor in Apache NiFi to pull out fields you want. Inside that processor you add a field and then a value like so $.entities.media[0].media_url that runs JsonPath

HDF 2.0 Diagram Overview

8493-sheetsnifi2.png

Reference:


sheets4ssl.png
2,268 Views
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
2 of 2
Last update:
‎08-17-2019 09:01 AM
Updated by:
 
Contributors
Top Kudoed Authors