07-10-2014 10:43 AM
I have set of files S3 location
I need to compare two files Content,( Tab separated values )
Each File will have minimum than 2 million records
If I execute the Select **** FROM (SELECT * FROM A A1 WHERE A1.ds in ('2014-06-11', '2014-06-12') ) A1 LEFT OUTER JOIN (SELECT * FROM B B1 WHERE B1.ds in ('2014-06-11', '2014-06-12') ) B1 Where customUDF(A1.data, B1.data) = true,
How much time impala will take ?
Is there any limitations with impala in performance aspect?
Thanks and Regards,
07-10-2014 10:49 AM
It's nearly impossible to say how long Impala will take. We don't know, for instance:
* How selective your WHERE clauses are (i.e. how many records each part of the plan will have to process)
* How many machines you're going to run on
* How wide your table is
* What the datatypes involved are
And even if we did have those, estimating the runtime would still be guesswork. The very, very best benchmark you can do is to spin up an Impala cluster (Impala is open-source and free!) and run your query on your own data.
I'll also add that Impala is not designed or tested to run on S3, and you may run into problems doing so.