Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

load multiple json file to hive table

Solved Go to solution
Highlighted

load multiple json file to hive table

i'am using nifi to stream data from twitter ,and nifi store each tweet in one file json with random name, so i have problem when i create table in hive and try to load this file to the table ; so how i can do it , there is command hive or script can load multiple file with random name to one hive table ???

1 ACCEPTED SOLUTION

Accepted Solutions

Re: load multiple json file to hive table

@nejm hadj

First I’ll answer your question and then I’ll make my recommendation.

Answer:

The name of the file does not matter. When setting up a Hive external table just specify the data source as the folder that will contain all the files (regardless of names).

Details on setting up and external table:

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/moving_data_from...

Details on reading/parsing JSON files into Hive:

http://hortonworks.com/blog/howto-use-hive-to-sqlize-your-own-tweets-part-two-loading-hive-sql-queri...

(alternatively, you can convert JSON to CSV within NiFi. To do so, follow the NiFi portion of this example https://community.hortonworks.com/articles/1282/sample-hdfnifi-flow-to-push-tweets-into-solrbanana.h...)

Recommendation:

HDFS prefers large files with many entries as opposed to many files with small entries. The main reason being that for each file landed on HDFS, file information is saved in the NameNode (in memory). If you’re putting each twitter message in a separate file you will quickly fill up your NameNodes’s memory and overload the server.

I suggest you aggregate multiple messages into one file before writing to HDFS. This can be done with the MergeContent processor in Nifi. Take a look at the below screenshots showing how it would be set up. Also, take a look at the NiFi Twitter_Dashboard.xml example template (https://raw.githubusercontent.com/abajwa-hw/ambari-nifi-service/master/demofiles/Twitter_Dashboard.x...). You can import this into your NiFi by by clicking on Templates (third icon from right) which will launch the 'Nifi Flow templates' popup, and selecting the file.

3098-screen-shot-2016-03-30-at-104307-am.png

3099-screen-shot-2016-03-30-at-104224-am.png

View solution in original post

2 REPLIES 2
Highlighted

Re: load multiple json file to hive table

Hi @nejm hadj does NiFi write the files to one directory? You should be able to specify an EXTERNAL table with the location as a directory. Then all the files in the dir will be part of the one table.

For example:

CREATE EXTERNAL TABLE tweets (col1 string, col2 string)
LOCATION '/path/to/dir';

Obviously use a JSON SerDe to parse the data into your table columns as well, but the LOCATION clause is sufficient to load all your files into your table.

Re: load multiple json file to hive table

@nejm hadj

First I’ll answer your question and then I’ll make my recommendation.

Answer:

The name of the file does not matter. When setting up a Hive external table just specify the data source as the folder that will contain all the files (regardless of names).

Details on setting up and external table:

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/moving_data_from...

Details on reading/parsing JSON files into Hive:

http://hortonworks.com/blog/howto-use-hive-to-sqlize-your-own-tweets-part-two-loading-hive-sql-queri...

(alternatively, you can convert JSON to CSV within NiFi. To do so, follow the NiFi portion of this example https://community.hortonworks.com/articles/1282/sample-hdfnifi-flow-to-push-tweets-into-solrbanana.h...)

Recommendation:

HDFS prefers large files with many entries as opposed to many files with small entries. The main reason being that for each file landed on HDFS, file information is saved in the NameNode (in memory). If you’re putting each twitter message in a separate file you will quickly fill up your NameNodes’s memory and overload the server.

I suggest you aggregate multiple messages into one file before writing to HDFS. This can be done with the MergeContent processor in Nifi. Take a look at the below screenshots showing how it would be set up. Also, take a look at the NiFi Twitter_Dashboard.xml example template (https://raw.githubusercontent.com/abajwa-hw/ambari-nifi-service/master/demofiles/Twitter_Dashboard.x...). You can import this into your NiFi by by clicking on Templates (third icon from right) which will launch the 'Nifi Flow templates' popup, and selecting the file.

3098-screen-shot-2016-03-30-at-104307-am.png

3099-screen-shot-2016-03-30-at-104224-am.png

View solution in original post

Don't have an account?
Coming from Hortonworks? Activate your account here