puthiveQL is very slow

mark_hadoop — Fri, 07 Sep 2018 20:35:10 GMT

Hi All,

I have a 10GB file every minute coming to a location (/dir), and there is an external table for that location.

The file is as below

karlon,n_d_1,26,6234,2019-09-08,1536278400
d'lov,research,20,1001,2019-09-08,1536278400
kris'a,b_x_3,20,4532,2019-09-08,1536278400

external table name: ex_t

name	department	age	id	date	time
karlon	n_d_1	26	6234	2019-09-08	1536278400
d'lov	research	20	1001	2018-09-08	1536278400

I have puthiveql processor in my flow which gets data from external table and inserts in to multiple ORC table.

ORC : table_1, table_2, table_3,table_4,table_5, table_6

Every table(orc table) has same columns.

name(string),department (string),age (int),id (int),date (string),partition_value (int)

The puthiveql processor has multiple insert queries in it.

INSERT INTO table_1 PARTITION(partition_value) SELECT name, department, age, id, date, cast(regexp_replace(date,'-','') as int) AS partition_value FROM ex_t WHERE department = 'research' AND time='1536278400';

INSERT INTO table_2 PARTITION(partition_value) SELECT name, department, age, id, date, cast(regexp_replace(date,'-','') as int) AS partition_value FROM ex_t WHERE department = 'n_d_1' AND time='1536278400';

INSERT INTO table_3 PARTITION(partition_value) SELECT name, department, age, id, date, cast(regexp_replace(date,'-','') as int) AS partition_value FROM ex_t WHERE department = 'b_x_3' AND time='1536278400';

INSERT INTO table_4 PARTITION(partition_value) SELECT name, department, age, id, date, cast(regexp_replace(date,'-','') as int) AS partition_value FROM ex_t WHERE department = 'research' AND time='1536278400';

INSERT INTO table_5 PARTITION(partition_value) SELECT name, department, age, id, date, cast(regexp_replace(date,'-','') as int) AS partition_value FROM ex_t WHERE department = 'x_in_1' AND time='1536278400';

INSERT INTO table_6 PARTITION(partition_value) SELECT name, department, age, id, date, cast(regexp_replace(date,'-','') as int) AS partition_value FROM ex_t WHERE department = 'z_e_3' AND time='1536278400';

The above is sent as a flowfile to puthiveql, which is scheduled every minute, as the file arrives every minute.

Puthiveql is very slow process the above and the inserts are not happening frequently.

Can you please suggest how to improve the performance of the puthiveql, I have increased the concurrent processor but it did not help, some times the flowfiles(which have insert statements) get queued and never execute.

Suggestions are highly appreciated.

Re: puthiveQL is very slow

rtheron — Sat, 08 Sep 2018 02:52:55 GMT

The problem here could be that the external table isn't structured to make the filter/split of this file optimal, for example " WHERE department = 'xxx' AND time='yyyy';" executed against a non-partitioned external table causes a complete file scan of the 10gb for each statement (so you're reading the 10GB entirely every time)

You may want to read the file into nifi flow file with a configured buffer as actual data in stead of taking the external table approach, alternatively, you can use an intermediate orc table that inserts the entire external file in some sort of sorted manner, before splitting it into multiple tables based on some filter (which you would optimize for in your intermediate table structure), I'd personally recommend the first approach though

Re: puthiveQL is very slow

mark_hadoop — Wed, 12 Sep 2018 20:57:32 GMT

@rtheron

for some reason I cannot follow the first approach.

I tried creating an intermediate orc with partitions and loaded the data in to it from external table.

now when I load in to the destination from the intermediate table, puthiveql is taking a lot of time.

any suggestions are appreciated.

question Re: puthiveQL is very slow in Archives of Support Questions (Read Only)

puthiveQL is very slow

Re: puthiveQL is very slow

Re: puthiveQL is very slow