Support Questions

pcarlso · ‎12-14-2020

I'm leveraging CDF and Nifi to orchestrate copying data from a relational database to Hive v3 on a daily basis. This is using CDF and CDP (on-prem). I have logic working for creating the tables via an AvroToORC conversion and then using the hive.ddl attribute to create the table. I have two primary issues:

1) There are no unique columns or primary keys in the source system. I have a Nifi processor that runs "delete from database.tablename" on Hive. This clears out the data and leaves the table structure. On large tables this can take some time. The reason I have to do this is the PutHive3Streaming processor is not able to recognize duplicates and thus will continually append to the database and over-inflate it with duplicate records. Are there other options for not needing to drop all the entries but still inserting the data?

2) From a performance standpoint, PutHive3Streaming is working but it's quite slow. I've compared it to insertion via sqoop and sqoop is substantially faster. I would like to use Nifi though because from an orchestration and monitoring standpoint, it seems like a better fit. Are there other processors that are a better fit for mass insertion of data? The incoming flowfiles contain around 50,000 records (around 15 MB I believe). From what I've read, the Hive streaming API seems to be suited more to Kafka or other messaging systems. I've also seen an example of running sqoop via Nifi but there are some other credential/access based challenges with that so I would prefer a Nifi solution.

I have 80+ tables, some with millions of records. Does anyone have suggestions on alternative methods or best practices leveraging Nifi to perform this work? Thanks in advance.

pcarlso · ‎12-24-2020

I believe I found a solution. I ended up writing the raw ORC files to HDFS (via PutHDFS) and then loading them into Hive internal tables (via Hive3QL). The command to load data into a Hive table from an existing file is:

LOAD DATA INPATH 'hdfs:///data/orc_file_name' OVERWRITE INTO TABLE hivedatabasename.tablename

View solution in original post

TimothySpann · ‎12-15-2020

mergerecord to PutOrc is fast

PutDatabaseRecord to Hive JDBC can be fast

are you using an upsert?

what version of NiFi? Hive? CDP?

https://github.com/tspannhw/ClouderaPublicCloudCDFWorkshop

https://www.datainmotion.dev/2020/04/streaming-data-with-cloudera-data-flow.html

https://community.cloudera.com/t5/Support-Questions/hive-table-loading-in-NIFI-extremely-slow/td-p/1...

pcarlso · ‎12-24-2020

I believe I found a solution. I ended up writing the raw ORC files to HDFS (via PutHDFS) and then loading them into Hive internal tables (via Hive3QL). The command to load data into a Hive table from an existing file is:

LOAD DATA INPATH 'hdfs:///data/orc_file_name' OVERWRITE INTO TABLE hivedatabasename.tablename

Cloudera Community

Support Questions

Nifi: batch insertion of data into Hive (requesting suggestions)

Data not inserting in hive table (CDP)

Nifi for batch ingest

NiFi: How to detect updates to S3 files and insert...

insert complex data in hive table

Visualizing Hive Data Using Microsoft Power BI

Operationalize NiFi data flows with Cloudera DataF...

Lessons learnt from nifi streaming data to hive tr...

How to insert data into Hive from SparkSQL

Spark2 save insert data to Hive with snappy compre...

Ingesting and Testing JMS Data with NiFi into Hive