Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Impala Perfomance benchmark

Impala Perfomance benchmark

New Contributor

Hello All,

I have set of files S3 location 

I need to compare two files Content,( Tab separated values )

Each File will have minimum than 2 million records

If I execute the Select **** FROM (SELECT * FROM A A1 WHERE A1.ds in ('2014-06-11', '2014-06-12') ) A1 LEFT OUTER JOIN (SELECT * FROM B B1 WHERE B1.ds in ('2014-06-11', '2014-06-12') ) B1 Where customUDF(A1.data, B1.data) = true, 

How much time impala will take ? 

 

Is there any limitations with impala in performance aspect?

 

Thanks and Regards,

SankarS

 

 

1 REPLY 1

Re: Impala Perfomance benchmark

Rising Star

Hi - 

 

It's nearly impossible to say how long Impala will take. We don't know, for instance:

 

* How selective your WHERE clauses are (i.e. how many records each part of the plan will have to process)

* How many machines you're going to run on

* How wide your table is

* What the datatypes involved are

 

And even if we did have those, estimating the runtime would still be guesswork. The very, very best benchmark you can do is to spin up an Impala cluster (Impala is open-source and free!) and run your query on your own data. 

 

I'll also add that Impala is not designed or tested to run on S3, and you may run into problems doing so.

 

Best,

Henry