Member since
08-10-2016
170
Posts
14
Kudos Received
6
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
19959 | 01-31-2018 04:55 PM | |
4281 | 11-29-2017 03:28 PM | |
1898 | 09-27-2017 02:43 PM | |
2052 | 09-12-2016 06:36 PM | |
1981 | 09-02-2016 01:58 PM |
09-25-2016
06:31 PM
How are you launching the spark job? FYI: The warning are just letting you know that you haven't setup the environment variables. You can fix that by checking your environment settings and correcting them. I'm pretty sure something like below would remove the warnings. export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:$LD_LIBRARY_PATH
... View more
09-13-2016
06:22 PM
2 Kudos
From what you have told me sounds like you should use a facet pivot and specifically checkout "Combining Facet Queries And Facet Ranges With Pivot Facets" it has an example that fits right into what you want to do.
... View more
09-12-2016
06:36 PM
1 Kudo
What do you want to do with the results? Are you going to summarize the results? You may want to look at pivoting a facet for summarizing the result. If you just want to group them. Don't forget to set a limit so a reasonable result is returned. let me know if this is what you where looking for.
... View more
09-09-2016
12:58 PM
I'm totally in agreement with @Constantin Stanca that in SQL 'IN' is for static lists and Exists is for dynamic data sets. His point is solid that they are different use cases. Exists will not shortcut in hive. (with default map/reduce.) Map/Reduce jobs don't have a shortcut method. (That said, if you use a different engine under the hood for HIVE like SPARK or Tez.. lots of those engines do use optimizations/partitions/strategies to only pull back the required data.) I did test a small query on a small data set and 'IN' and 'EXISTS' ran in exactly the same time. (On Map/reduce) When I looked at the Execution plan for both queries (using a sub query for both EXISTS and IN) they were functionally equivalent. Meaning it doesn't matter for map/reduce what you use from a speed perspective. (If you are using a subquery.) To me this lends more strength to @Constantin Stanca statement. Follow the SQL convention.
... View more
09-08-2016
04:04 PM
@Constantin Stanca Can you explain in a little more detail? 'Exists' does have to do 2 table scans and join the results. (And yes it can shortcut the boolean logic) 'In' only does one table scan, and no join. Is it because the work is better distributed via reducers with 'Exists'?
... View more
09-07-2016
07:16 PM
You know I don't know. The fastest way would be to try both, and see which is faster. In general you could use explain to compare the two but you do need to understand how to read 'explain' it's not that intuitive. (I would skip trying to look at the syntax tree and focus on the dependencies.) If you are concerned about optimizing you query read over this article which highlights some quick wins for speeding up hive.
... View more
09-02-2016
01:58 PM
Hey Mohan,
It might be cleaner to just write a UDF. I certainly feel like PIG has purposely left this functionality out to encourage the use of datetime over the use of epochs.
Here's one way to achieve what you want to do: a = load 'test.data';
b = foreach a GENERATE *, CONCAT( 'epochtime_', (chararray)MilliSecondsBetween(CurrentTime(),ToDate(0)))
;dump b;
Hope this helps.
... View more
09-01-2016
02:37 AM
1 Kudo
The answer is it all depends on how YARN is setup for queues. All tools(sqoop, pig, hive) have a way of specifying queue via command line (example) If you are using HUE it can even be setup to impersonate your user. So you really do need to understand how yarn is setup for queuing. You don't need to configure the queue if yarn isn't configured for queues. If it is then you have to read the configuration to know what will happen.
... View more
09-01-2016
02:14 AM
I like my answer but you should also check out https://community.hortonworks.com/questions/394/what-are-best-practices-for-setting-up-backup-and.html
... View more
- « Previous
- Next »