Created on 10-12-201603:33 PM - edited 08-17-201909:01 AM
Often lines of business, individual users or shared teams will use online Google Sheets to share spreadsheet and tabular data amongst teams or without outside vendors. It's quick and easy to add sheets and store your data in Google Drive as spreadsheets.
Often you will want to consolidate, federate, analyze, enrich and use this data for reporting and dashboards throughout your organization.
An easy way to do that is to read in the data using Google's Sheet API. This is a standard SSL HTTP REST API that returns clean JSON data.
I created a simple Google Sheet to test ingesting a Google Sheet with HDF. You will need to enable Google Sheets API in the Google APIs Console. You must be logged into Google and have a Google Account (use the one where you created your Google Spreadsheets).
The one thing you will need is to setup a StandardSSLContextService to read in HTTPS data.
You will need to grab the truststore file cacerts for the JRE that NiFi is using to run.
By default the Truststore Password is changeit. You really should change it.
Once you have an SSL configuration setup, then you can do a GetHTTP. You add in the Sheets GoogleAPI URL that includes the Sheet ID. I also set the User Agent, Accept Content-type and Follow Redirects = True.
Now that we have SSL enabled, we can make our call to Google. The flow below is pretty simple.
Now that I have ingested the Google Sheet, I can store it as JSON in my data lake. You could process this in HDF many ways including taking out fields, enriching with other data sources, converting to AVRO or ORC, storing in a HIVE table, Phoenix or HBase.
You have now ingested Google Sheet data. Determining what you want to do to it and parsing out the JSON is a fun exercise. You can use an EvaluateJsonPath processor in Apache NiFi to pull out fields you want. Inside that processor you add a field and then a value like so $.entities.media.media_url that runs JsonPath