Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

From CSV to Hive via NiFi

avatar
Contributor

I want to use NiFi to ingest CSV file (getFile) to Hive (PutHiveQL).

Since HiveQL need the SQL statement, how can I generate it ?

A solution would be getFile -> InferAvroSchema -> ConvertCSVToAvro -> ConvertAvroToJSon -> ConvertJSONtoSQL -> PutHiveQL.

This looks so complex and resource consuming. Any suggestions ?

1 ACCEPTED SOLUTION

avatar
Master Guru
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
11 REPLIES 11

avatar
Super Guru

@Joe Harvy

Nifi is best used for ingesting live streaming data with 1000's of records per second. For your use case, why not simply import the file in Hadoop in a staging area, create temp table, and then do insert select using Hive. While inserting, simply change the format to ORC.

avatar
Contributor

Hi @mqureshi

I don't agree with your statement. If it's the case, why there's processors like getFile, getHDFS, QueyTable etc ?

avatar
Super Guru

@Joe Harvy There are many use cases where you will need to get file, change format, extract/drop records, filter json and so on. Your use case does not seem to be one. But fair enough, if you don't agree and would still like to go the path you want. I am sure somebody would you a better answer that can validate your approach.

avatar

Do you need to convert to JSON? From Avro you can use the Kite DataSet processor and store in Hive as Parquet:

https://community.hortonworks.com/articles/70257/hdfnifi-to-convert-row-formatted-text-files-to-col....

avatar
Contributor

@Binu Mathew

Thanks for your reply. No I don't need data in JSON, I just need to ingest directly into Hive. How do you ingest data in Hive in your suggestion ? PutHiveQL waits for an SQL statement

avatar

Configure the 'StoreInKiteDataset' processor to set the URI for your Hive table. Your Avro formatted data will be converted to a Parquet formatted Hive table. In the processor, set the property 'Target dataset URI' to your Hive table. For example, I'm writing to a Hive table named weblogs_parquet - dataset:hive://ip-172-31-2-102.us-west-2.compute.internal:9083/weblogs_parquet

Learn more about Kite Datasets at http://kitesdk.org/docs/current/

avatar
Master Guru
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Contributor

Hi @Matt Burgess

Thanks for your detailed answer. Your first suggestion looks interesting. I'll give it a try.

I still have a question on ConvertJsonToSQL if you can help

https://community.hortonworks.com/questions/80362/jdbc-connection-pool-for-convertjsontosql.html

avatar
Master Guru

I added an answer to that question, but it is likely unsatisfying as it is an open issue. The Hive driver used in the Hive processors is based on Apache Hive 1.2.x which does not support a handful of JDBC API methods used by those processors.