About matt_andruff

matt_andruff · ‎09-25-2016

How are you launching the spark job? FYI: The warning are just letting you know that you haven't setup the environment variables. You can fix that by checking your environment settings and correcting them. I'm pretty sure something like below would remove the warnings. export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:$LD_LIBRARY_PATH

matt_andruff · ‎09-13-2016

From what you have told me sounds like you should use a facet pivot and specifically checkout "Combining Facet Queries And Facet Ranges With Pivot Facets" it has an example that fits right into what you want to do.

matt_andruff · ‎09-12-2016

What do you want to do with the results? Are you going to summarize the results? You may want to look at pivoting a facet for summarizing the result. If you just want to group them. Don't forget to set a limit so a reasonable result is returned. let me know if this is what you where looking for.

matt_andruff · ‎09-09-2016

I'm totally in agreement with @Constantin Stanca that in SQL 'IN' is for static lists and Exists is for dynamic data sets. His point is solid that they are different use cases. Exists will not shortcut in hive. (with default map/reduce.) Map/Reduce jobs don't have a shortcut method. (That said, if you use a different engine under the hood for HIVE like SPARK or Tez.. lots of those engines do use optimizations/partitions/strategies to only pull back the required data.) I did test a small query on a small data set and 'IN' and 'EXISTS' ran in exactly the same time. (On Map/reduce) When I looked at the Execution plan for both queries (using a sub query for both EXISTS and IN) they were functionally equivalent. Meaning it doesn't matter for map/reduce what you use from a speed perspective. (If you are using a subquery.) To me this lends more strength to @Constantin Stanca statement. Follow the SQL convention.

matt_andruff · ‎09-08-2016

@Constantin Stanca Can you explain in a little more detail? 'Exists' does have to do 2 table scans and join the results. (And yes it can shortcut the boolean logic) 'In' only does one table scan, and no join. Is it because the work is better distributed via reducers with 'Exists'?

matt_andruff · ‎09-07-2016

You know I don't know. The fastest way would be to try both, and see which is faster. In general you could use explain to compare the two but you do need to understand how to read 'explain' it's not that intuitive. (I would skip trying to look at the syntax tree and focus on the dependencies.) If you are concerned about optimizing you query read over this article which highlights some quick wins for speeding up hive.

matt_andruff · ‎09-02-2016

That Looks better than what I proposed. Thumbs up!

matt_andruff · ‎09-02-2016

Hey Mohan, It might be cleaner to just write a UDF. I certainly feel like PIG has purposely left this functionality out to encourage the use of datetime over the use of epochs. Here's one way to achieve what you want to do: a = load 'test.data'; b = foreach a GENERATE *, CONCAT( 'epochtime_', (chararray)MilliSecondsBetween(CurrentTime(),ToDate(0))) ;dump b; Hope this helps.

matt_andruff · ‎09-01-2016

The answer is it all depends on how YARN is setup for queues. All tools(sqoop, pig, hive) have a way of specifying queue via command line (example) If you are using HUE it can even be setup to impersonate your user. So you really do need to understand how yarn is setup for queuing. You don't need to configure the queue if yarn isn't configured for queues. If it is then you have to read the configuration to know what will happen.

matt_andruff · ‎09-01-2016

I like my answer but you should also check out https://community.hortonworks.com/questions/394/what-are-best-practices-for-setting-up-backup-and.html

Online	Offline
Last Visited	‎08-12-2019 05:02 PM

Member Since	‎08-10-2016 12:09 PM
Last Visited	‎08-12-2019 05:02 PM
Posts	170
Kudos received	14

Cloudera Community

Re: Kerberos: Failure to initialize security conte...

Re: Can I add a subcolumn to a hive struct column ...

Re: Cloudbreak verbose logging - RuntimeException:...

Re: Solr query question

Re: How to get EPOCH time in PIG?

Re: Spark using all resources on Cluster

Re: Solr query question

Re: Solr query question

Re: Exists or IN which performs better

Re: Exists or IN which performs better

Re: Exists or IN which performs better

Re: How to get EPOCH time in PIG?

Re: How to get EPOCH time in PIG?

Re: Is configuring queue optional for Oozie action...

Re: DR/replication strategy using distcp + Oozie