Support Questions

Find answers, ask questions, and share your expertise

puthiveQL is very slow

avatar
Expert Contributor

Hi All,

I have a 10GB file every minute coming to a location (/dir), and there is an external table for that location.

The file is as below

karlon,n_d_1,26,6234,2019-09-08,1536278400
d'lov,research,20,1001,2019-09-08,1536278400
kris'a,b_x_3,20,4532,2019-09-08,1536278400

external table name: ex_t

namedepartmentageiddatetime
karlonn_d_12662342019-09-081536278400
d'lovresearch2010012018-09-08

1536278400

I have puthiveql processor in my flow which gets data from external table and inserts in to multiple ORC table.

ORC : table_1, table_2, table_3,table_4,table_5, table_6

Every table(orc table) has same columns.

name(string),department (string),age (int),id (int),date (string),partition_value (int)

The puthiveql processor has multiple insert queries in it.

INSERT INTO table_1 PARTITION(partition_value) SELECT name, department, age, id, date, cast(regexp_replace(date,'-','') as int) AS partition_value FROM ex_t WHERE department = 'research' AND time='1536278400';

INSERT INTO table_2 PARTITION(partition_value) SELECT name, department, age, id, date, cast(regexp_replace(date,'-','') as int) AS partition_value FROM ex_t WHERE department = 'n_d_1' AND time='1536278400';

INSERT INTO table_3 PARTITION(partition_value) SELECT name, department, age, id, date, cast(regexp_replace(date,'-','') as int) AS partition_value FROM ex_t WHERE department = 'b_x_3' AND time='1536278400';

INSERT INTO table_4 PARTITION(partition_value) SELECT name, department, age, id, date, cast(regexp_replace(date,'-','') as int) AS partition_value FROM ex_t WHERE department = 'research' AND time='1536278400';

INSERT INTO table_5 PARTITION(partition_value) SELECT name, department, age, id, date, cast(regexp_replace(date,'-','') as int) AS partition_value FROM ex_t WHERE department = 'x_in_1' AND time='1536278400';

INSERT INTO table_6 PARTITION(partition_value) SELECT name, department, age, id, date, cast(regexp_replace(date,'-','') as int) AS partition_value FROM ex_t WHERE department = 'z_e_3' AND time='1536278400';

The above is sent as a flowfile to puthiveql, which is scheduled every minute, as the file arrives every minute.

Puthiveql is very slow process the above and the inserts are not happening frequently.

Can you please suggest how to improve the performance of the puthiveql, I have increased the concurrent processor but it did not help, some times the flowfiles(which have insert statements) get queued and never execute.

Suggestions are highly appreciated.

1 ACCEPTED SOLUTION

avatar
Rising Star

The problem here could be that the external table isn't structured to make the filter/split of this file optimal, for example " WHERE department = 'xxx' AND time='yyyy';" executed against a non-partitioned external table causes a complete file scan of the 10gb for each statement (so you're reading the 10GB entirely every time)

You may want to read the file into nifi flow file with a configured buffer as actual data in stead of taking the external table approach, alternatively, you can use an intermediate orc table that inserts the entire external file in some sort of sorted manner, before splitting it into multiple tables based on some filter (which you would optimize for in your intermediate table structure), I'd personally recommend the first approach though

View solution in original post

2 REPLIES 2

avatar
Rising Star

The problem here could be that the external table isn't structured to make the filter/split of this file optimal, for example " WHERE department = 'xxx' AND time='yyyy';" executed against a non-partitioned external table causes a complete file scan of the 10gb for each statement (so you're reading the 10GB entirely every time)

You may want to read the file into nifi flow file with a configured buffer as actual data in stead of taking the external table approach, alternatively, you can use an intermediate orc table that inserts the entire external file in some sort of sorted manner, before splitting it into multiple tables based on some filter (which you would optimize for in your intermediate table structure), I'd personally recommend the first approach though

avatar
Expert Contributor

@rtheron

for some reason I cannot follow the first approach.

I tried creating an intermediate orc with partitions and loaded the data in to it from external table.

now when I load in to the destination from the intermediate table, puthiveql is taking a lot of time.

any suggestions are appreciated.