Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hive on Spark Queries are not working

avatar
Contributor

I have installed Spark and configure Hive to use it as execution engine. 

 

Select * from table name works fine. 

 

But select count(*) from table name fails with following error: 

 

  • Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
 
At times also got an error stating "failed to create spark client"
 
I have also tried to modify the memort parameters but to no avail. Can you please tell me what should be the ideal memory setting? 
 
Below is the directory structure from hdfs
 

drwxr-xr-x   - admin    admin               0 2017-07-28 16:36 /user/admin

drwx------   - ec2-user supergroup          0 2017-07-28 17:50 /user/ec2-user

drwxr-xr-x   - hdfs     hdfs                0 2017-07-28 11:37 /user/hdfs

drwxrwxrwx   - mapred   hadoop              0 2017-07-16 06:03 /user/history

drwxrwxr-t   - hive     hive                0 2017-07-16 06:04 /user/hive

drwxrwxr-x   - hue      hue                 0 2017-07-28 10:16 /user/hue

drwxrwxr-x   - impala   impala              0 2017-07-16 07:13 /user/impala

drwxrwxr-x   - oozie    oozie               0 2017-07-16 06:05 /user/oozie

drwxr-x--x   - spark    spark               0 2017-07-28 17:17 /user/spark

drwxrwxr-x   - sqoop2   sqoop               0 2017-07-16 06:37 /user/sqoop2

 

the /user directory has owner as ec2-user and group as supergroup.

 

I tried running the query from CLI: 

 

WARNING: Hive CLI is deprecated and migration to Beeline is recommended.

hive> select count(*) from kaggle.test_house;

Query ID = ec2-user_20170728174949_aa9d7be9-038c-44a0-a42b-1b210a37f4ec

Total jobs = 1

Launching Job 1 out of 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark client.)'

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask

4 REPLIES 4

avatar
Champion
The reason the first query works is because it does not need any MR or Spark jobs to run. The HS2 or Hive client just read the data directly. The second query requires MR or Spark jobs to be ran. This is key to remember when testing or troubleshooting the cluster.

Are you able to run Spark jobs out side of Hive?

Try the below command but swap out to your jar version.

spark-submit --class org.apache.spark.examples.SparkPi --master yarn --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 /opt/cloudera/parcels/SPARK/lib/spark/examples/jars/spark-examples_*.jar

Also access the Spark History server to get to the driver and executor logs to try to get more details on the failure.

avatar
Contributor

Thank you for the reply. 

 

I did not have the spark folder in the location. I had SPARK2. After I run the command. I get the below error. 

 

[ec2-user@ip-172-31-37-124 jars]$ spark-submit --class org.apache.spark.examples.SparkPi --master yarn --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 /opt/cloudera/parcels/SPARK2/lib/spark2/examples/jars/spark-examples_2.11-2.2.0.cloudera1.jar 

WARNING: User-defined SPARK_HOME (/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/spark) overrides detected (/usr/lib/spark).

WARNING: Running spark-class from user-defined location.

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/SparkSession$

at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:28)

at org.apache.spark.examples.SparkPi.main(SparkPi.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:730)

at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)

at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.SparkSession$

at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

at java.lang.ClassLoader.loadClass(ClassLoader.java:425)

at java.lang.ClassLoader.loadClass(ClassLoader.java:358)

... 11 more

avatar
Contributor

The spark request is now getting submitted but now i am getting following error: 

 

 

hive> select count(*) from kaggle.test_house;

Query ID = ec2-user_20170729070303_887365d6-ce92-4ec3-bc8a-2adf3cfec117

Total jobs = 1

Launching Job 1 out of 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Starting Spark Job = 614015ef-31f9-4e14-9b71-c161f64916db

Job hasn't been submitted after 61s. Aborting it.

Possible reasons include network issues, errors in remote driver or the cluster has no available resources, etc.

Please check YARN or Spark driver's logs for further information.

Status: SENT

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.spark.SparkTask

avatar
New Contributor

Hi ,

 

I am also getting the same below error running Hive on Spark using IBM data stage

 

main_program: Fatal Error: The connector received an error from the driver. The reported error is: [SQLSTATE HY000] java.sql.SQLException: [IBM][Hive JDBC Driver][Hive]Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask.

 

Were you able to resolve the issue.

 

Thanks,

Jalaj