Support Questions

JananiViswa1 · ‎07-14-2016

Hello Everyone!

On Hive 1.1.0 the following query executes for one hour & the table has nearly 2 billion records.

select

C1,

C2,

C3,

C4,

from <tablename>

where NOT ( instr(FileName,'<sometext>')> 0

or instr(FileName,'<sometext>')> 0

or instr(FileName,'<sometext>')>0

or FileName='' )

and FILENAME not like '%<sometext>%'

..........;

Kindly suggest some better ways to optimise the above query.

Thanks In Advance,

Janani

yshi · ‎07-21-2016

@JananiViswa1

From the first galance of this problem, I can see that you have a lot of "like" operators in your query. An like operator incurs regular expression matching, which is very costive, and may cause slowness to the query.

Have you noticed where the slowness happens? Is it within Hive itself, or is it just the MR job runs for a long time? If it is the MR job that slows everything down, please consider reducing the split size of the job and thus using more mappers to process the input data. To do this, please run below commands before the query:

set mapred.min.split.size=63000000;
set mapred.max.split.size=64000000;

If my assumption is wrong, or you still have problem after applying above change, please give me more info so that I can investigate further:

1) Hive log

If you use Hive CLI, please give us the command output

If you use HS2, please give us the HS2 log file or relevant information in it

2) mapreduce job configuration and log file

3) The definition of your source table (output of "show create table <tbl_name>")

View solution in original post

yshi · ‎07-21-2016