Hello there,
I need help with
making a Nifi flow in which I need to gather all columns of hive tables and
then insert it into a metadata table in hive. Attached below is a screen capture of the
flow:(flow-with-issue.png)The last step of
PutHiveQL I am getting a lot of messages queued up. I changed the number of
concurrent tasks to 5, however this is still a very slow process. Here are the
steps:
- SelectHiveQL - runs Show
columns in table
- Convert Avro to Json
- split Json to each a file for
each column.
- Evaluate Json to get column
names
- ReplaceText is used to create
the insert statements, and create table for the metadata table and the
configuration is as below. (ReplaceText_configuration.png)
- I run merge content to
combine inserts as much as possible
- PutHiveQL then executes it
I have over 200
tables from SQL Server that are streaming into Hadoop, some have as much as 50
- 60 columns.
I did some research
on Hortonworks and elsewhere and found posts that it is not recommended to run
PutHiveQL for large data inserts.
I am taking this
approach as I don’t have dba_columns version in Hive and I need to know the
column names and primary key of the table I just ingested. So I am saving that
as a metadata that I query from SQL server.
Is there a better
way to solve for this? Maybe persist the data for Nifi flow to access in
memory?
Thank you.
