Member since
09-29-2015
3
Posts
0
Kudos Received
0
Solutions
06-10-2016
10:56 AM
Hi Ankit, I'll try to shed some light on the mystical art of using HoS (Hive on Spark). First of all, no, you don't need to install HS2 (HiveServer2) on all nodes in the cluster. Having Spark Gateway role on all nodes is a good solution, the docs just want to make sure you have one on the same node the HS2 is running. As for the other question, that's also a negative. HS2 user server-client architecture, so basically you can run the client (beeline) on any node in the cluster and connect to HS2 to submit the query for execution. The job (query) will then get executed on random nodes in the cluster.
... View more
05-05-2016
07:46 PM
Does the code work when submitted through spark submit? It's most probable that you have some dependency pulling in an older version of hadoop/yarn libraries. Look for hadoop or yarn jar files in you package. Also, "yarn-version-info.properties" file should contain the version information. The CDH 5.5 is based off the Hadoop/Yarn 2.6.0 and ideally you'll be using Cloudera provided dependency package. The version in that case should be "2.6.0-cdh5.5". More information on dependency jars that Cloudera provides can be found here: http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_vd_cdh5_maven_repo.html#concept_xxt_m11_d5_unique_2
... View more
11-04-2015
02:36 PM
Hello Venu, The spill messages and the log snippet indicate that the Hive's MapReduce task is using disk to sort data because the buffer allocated for sorting is full. There's couple of things that you can tune: 1. Increase the container memory allocated to Map tasks (remember to increase the heap size of the map task too!) 2. Increase the sort buffer size (mapreduce.task.io.sort.mb) Hope this helps.
... View more