Created 09-07-2018 01:35 PM
Hi All,
I have a 10GB file every minute coming to a location (/dir), and there is an external table for that location.
The file is as below
karlon,n_d_1,26,6234,2019-09-08,1536278400 d'lov,research,20,1001,2019-09-08,1536278400 kris'a,b_x_3,20,4532,2019-09-08,1536278400
external table name: ex_t
name | department | age | id | date | time |
karlon | n_d_1 | 26 | 6234 | 2019-09-08 | 1536278400 |
d'lov | research | 20 | 1001 | 2018-09-08 | 1536278400 |
I have puthiveql processor in my flow which gets data from external table and inserts in to multiple ORC table.
ORC : table_1, table_2, table_3,table_4,table_5, table_6
Every table(orc table) has same columns.
name(string),department (string),age (int),id (int),date (string),partition_value (int)
The puthiveql processor has multiple insert queries in it.
INSERT INTO table_1 PARTITION(partition_value) SELECT name, department, age, id, date, cast(regexp_replace(date,'-','') as int) AS partition_value FROM ex_t WHERE department = 'research' AND time='1536278400'; INSERT INTO table_2 PARTITION(partition_value) SELECT name, department, age, id, date, cast(regexp_replace(date,'-','') as int) AS partition_value FROM ex_t WHERE department = 'n_d_1' AND time='1536278400'; INSERT INTO table_3 PARTITION(partition_value) SELECT name, department, age, id, date, cast(regexp_replace(date,'-','') as int) AS partition_value FROM ex_t WHERE department = 'b_x_3' AND time='1536278400'; INSERT INTO table_4 PARTITION(partition_value) SELECT name, department, age, id, date, cast(regexp_replace(date,'-','') as int) AS partition_value FROM ex_t WHERE department = 'research' AND time='1536278400'; INSERT INTO table_5 PARTITION(partition_value) SELECT name, department, age, id, date, cast(regexp_replace(date,'-','') as int) AS partition_value FROM ex_t WHERE department = 'x_in_1' AND time='1536278400'; INSERT INTO table_6 PARTITION(partition_value) SELECT name, department, age, id, date, cast(regexp_replace(date,'-','') as int) AS partition_value FROM ex_t WHERE department = 'z_e_3' AND time='1536278400';
The above is sent as a flowfile to puthiveql, which is scheduled every minute, as the file arrives every minute.
Puthiveql is very slow process the above and the inserts are not happening frequently.
Can you please suggest how to improve the performance of the puthiveql, I have increased the concurrent processor but it did not help, some times the flowfiles(which have insert statements) get queued and never execute.
Suggestions are highly appreciated.
Created 09-07-2018 07:52 PM
The problem here could be that the external table isn't structured to make the filter/split of this file optimal, for example " WHERE department = 'xxx' AND time='yyyy';" executed against a non-partitioned external table causes a complete file scan of the 10gb for each statement (so you're reading the 10GB entirely every time)
You may want to read the file into nifi flow file with a configured buffer as actual data in stead of taking the external table approach, alternatively, you can use an intermediate orc table that inserts the entire external file in some sort of sorted manner, before splitting it into multiple tables based on some filter (which you would optimize for in your intermediate table structure), I'd personally recommend the first approach though
Created 09-07-2018 07:52 PM
The problem here could be that the external table isn't structured to make the filter/split of this file optimal, for example " WHERE department = 'xxx' AND time='yyyy';" executed against a non-partitioned external table causes a complete file scan of the 10gb for each statement (so you're reading the 10GB entirely every time)
You may want to read the file into nifi flow file with a configured buffer as actual data in stead of taking the external table approach, alternatively, you can use an intermediate orc table that inserts the entire external file in some sort of sorted manner, before splitting it into multiple tables based on some filter (which you would optimize for in your intermediate table structure), I'd personally recommend the first approach though
Created 09-12-2018 01:57 PM
for some reason I cannot follow the first approach.
I tried creating an intermediate orc with partitions and loaded the data in to it from external table.
now when I load in to the destination from the intermediate table, puthiveql is taking a lot of time.
any suggestions are appreciated.