Reply
New Contributor
Posts: 1
Registered: ‎07-10-2014

Impala Perfomance benchmark

Hello All,

I have set of files S3 location 

I need to compare two files Content,( Tab separated values )

Each File will have minimum than 2 million records

If I execute the Select **** FROM (SELECT * FROM A A1 WHERE A1.ds in ('2014-06-11', '2014-06-12') ) A1 LEFT OUTER JOIN (SELECT * FROM B B1 WHERE B1.ds in ('2014-06-11', '2014-06-12') ) B1 Where customUDF(A1.data, B1.data) = true, 

How much time impala will take ? 

 

Is there any limitations with impala in performance aspect?

 

Thanks and Regards,

SankarS

 

 

Cloudera Employee
Posts: 40
Registered: ‎08-15-2013

Re: Impala Perfomance benchmark

Hi - 

 

It's nearly impossible to say how long Impala will take. We don't know, for instance:

 

* How selective your WHERE clauses are (i.e. how many records each part of the plan will have to process)

* How many machines you're going to run on

* How wide your table is

* What the datatypes involved are

 

And even if we did have those, estimating the runtime would still be guesswork. The very, very best benchmark you can do is to spin up an Impala cluster (Impala is open-source and free!) and run your query on your own data. 

 

I'll also add that Impala is not designed or tested to run on S3, and you may run into problems doing so.

 

Best,

Henry

Announcements