Support Questions

simran_k · ‎05-18-2016

I am planning to write a python program that can import data to HDFS.

1. Any examples?

2. How to import JSON response from API to hive?

3. Why should I ever have to manuall create external tables in hive and how to even decide and create schema for whole set of data that the API returns?

LesterMartin · ‎05-18-2016

My thoughts on #1 & #2: Some googling shows there are folks out there with somewhat direct ways for you to open an HDFS file and write to it as you are getting the data from the external system (google api in this case). That said, I'd consider applying the KISS principle and have your python program write the results into a file so that when you are done (and you are sure you are done -- i.e. this helps prevent a half-baked file in HDFS) simply use the hadoop fs -put command to drop the complete file exactly where you want it in HDFS.

As for #3: You have to create table (external or not -- as even "managed" tables can reside outside of /apps/hive/warehouse) as this is a layered on ecosystem tool above base HDFS and this CREATE TABLE DDL command will store metadata about the logical table you want to be mapped to your data. The good news is that you can create that table before or after you load the data. Additionally, if you are going to continue to add net-new data to the table, you don't have to create it again.

Good luck!

View solution in original post

LesterMartin · ‎05-18-2016

My thoughts on #1 & #2: Some googling shows there are folks out there with somewhat direct ways for you to open an HDFS file and write to it as you are getting the data from the external system (google api in this case). That said, I'd consider applying the KISS principle and have your python program write the results into a file so that when you are done (and you are sure you are done -- i.e. this helps prevent a half-baked file in HDFS) simply use the hadoop fs -put command to drop the complete file exactly where you want it in HDFS.

As for #3: You have to create table (external or not -- as even "managed" tables can reside outside of /apps/hive/warehouse) as this is a layered on ecosystem tool above base HDFS and this CREATE TABLE DDL command will store metadata about the logical table you want to be mapped to your data. The good news is that you can create that table before or after you load the data. Additionally, if you are going to continue to add net-new data to the table, you don't have to create it again.

Good luck!

Cloudera Community

Support Questions

How to import data return by Google analytic s API to hive