Support Questions
Find answers, ask questions, and share your expertise

Is there a better alternative to PutHiveQL processor, it is very slow

Is there a better alternative to PutHiveQL processor, it is very slow

Hello there,

I need help with making a Nifi flow in which I need to gather all columns of hive tables and then insert it into a metadata table in hive. Attached below is a screen capture of the flow:(flow-with-issue.png)The last step of PutHiveQL I am getting a lot of messages queued up. I changed the number of concurrent tasks to 5, however this is still a very slow process. Here are the steps:

  1. SelectHiveQL - runs Show columns in table
  2. Convert Avro to Json
  3. split Json to each a file for each column.
  4. Evaluate Json to get column names
  5. ReplaceText is used to create the insert statements, and create table for the metadata table and the configuration is as below. (ReplaceText_configuration.png)
  6. I run merge content to combine inserts as much as possible
  7. PutHiveQL then executes it

I have over 200 tables from SQL Server that are streaming into Hadoop, some have as much as 50 - 60 columns.

I did some research on Hortonworks and elsewhere and found posts that it is not recommended to run PutHiveQL for large data inserts.

I am taking this approach as I don’t have dba_columns version in Hive and I need to know the column names and primary key of the table I just ingested. So I am saving that as a metadata that I query from SQL server.

Is there a better way to solve for this? Maybe persist the data for Nifi flow to access in memory?

Thank you.


flow-with-issue.pngreplacetext-configuration.png
1 REPLY 1
Highlighted

Re: Is there a better alternative to PutHiveQL processor, it is very slow

I resolved this by using putHDFS processor to write data to HDFS, create external table and then running one insert overwrite to distinct data put.