Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Nifi: batch insertion of data into Hive (requesting suggestions)

avatar
New Contributor

I'm leveraging CDF and Nifi to orchestrate copying data from a relational database to Hive v3 on a daily basis.  This is using CDF and CDP (on-prem).  I have logic working for creating the tables via an AvroToORC conversion and then using the hive.ddl attribute to create the table.  I have two primary issues:

 

1) There are no unique columns or primary keys in the source system.  I have a Nifi processor that runs "delete from database.tablename" on Hive.  This clears out the data and leaves the table structure.  On large tables this can take some time.  The reason I have to do this is the PutHive3Streaming processor is not able to recognize duplicates and thus will continually append to the database and over-inflate it with duplicate records.  Are there other options for not needing to drop all the entries but still inserting the data?

 

2) From a performance standpoint, PutHive3Streaming is working but it's quite slow.  I've compared it to insertion via sqoop and sqoop is substantially faster.  I would like to use Nifi though because from an orchestration and monitoring standpoint, it seems like a better fit.  Are there other processors that are a better fit for mass insertion of data?  The incoming flowfiles contain around 50,000 records (around 15 MB I believe).  From what I've read, the Hive streaming API seems to be suited more to Kafka or other messaging systems.  I've also seen an example of running sqoop via Nifi but there are some other credential/access based challenges with that so I would prefer a Nifi solution.

 

I have 80+ tables, some with millions of records.  Does anyone have suggestions on alternative methods or best practices leveraging Nifi to perform this work?  Thanks in advance.

1 ACCEPTED SOLUTION

avatar
New Contributor

I believe I found a solution.  I ended up writing the raw ORC files to HDFS (via PutHDFS) and then loading them into Hive internal tables (via Hive3QL).  The command to load data into a Hive table from an existing file is:

 

LOAD DATA INPATH 'hdfs:///data/orc_file_name' OVERWRITE INTO TABLE hivedatabasename.tablename

View solution in original post

2 REPLIES 2

avatar
Master Guru

avatar
New Contributor

I believe I found a solution.  I ended up writing the raw ORC files to HDFS (via PutHDFS) and then loading them into Hive internal tables (via Hive3QL).  The command to load data into a Hive table from an existing file is:

 

LOAD DATA INPATH 'hdfs:///data/orc_file_name' OVERWRITE INTO TABLE hivedatabasename.tablename