Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to use Google Analytics API to import data to datawarehouse built on hadoop

avatar
Expert Contributor

How to to use Google Analytics API to import data in hadoop file system? Is there a tool that can interact directly with the API and import data to the warehouse or Do I need to write programs for it? Could someone please point me to an example that shows import of data from Google Analytics to HDFS, more precisely to hive ?

1 ACCEPTED SOLUTION

avatar

Sounds like https://community.hortonworks.com/questions/33961/how-to-import-data-return-by-google-analytic-s-api... was a repost of this earlier question. I provided my (more generic) answer over there, but maybe someone has a more specific response tied directly to Google Analytics and Hadoop. Good luck!

View solution in original post

2 REPLIES 2

avatar

Sounds like https://community.hortonworks.com/questions/33961/how-to-import-data-return-by-google-analytic-s-api... was a repost of this earlier question. I provided my (more generic) answer over there, but maybe someone has a more specific response tied directly to Google Analytics and Hadoop. Good luck!

avatar
New Contributor

We kind of built a data warehouse around the same idea that you have talked about in your article.

Integrating Salesforce and Google analytics as data-warehouse @infocaptor http://www.infocaptor.com

The benefit is you can also co-relate with your financial data When you design using GA api, you need to load the initial historical data for a certain date range. This has its own complications as you might run into segmentation issues, loss of data etc. You need to handle pagination etc.

Once the initial data load is complete, you then run it in incremental mode where you just bring new data only. This data gets appended to the same Data warehouse tables and does not cause duplicate with overlapping dates. The minimum you would need to design is some kind of background daemon that runs everyday or at some frequency.

You will need job tables to monitor the success and failure of the extracts so that it can resume from where the error occurred.

Some of the other considerations

1. What happens if you run the extract for the same data range

2. What if a job fails for certain dates It is important to set your primary keys for your DW target tables.

The extracted data is stored as CSV files and these can be easily pushed to Hadoop file system.