Reply
Contributor
Posts: 49
Registered: ‎01-05-2016

Oozie Spark Action - Can't read from Hive table created as External Table pointing to Hbase table

Hello, I am on CDH 5.7

 

I have an HBase table. To be able to access it from Impala/Hive, I have created a Hive external table pointing to it, using HBaseStorageHandler, using the following approach:

 

CREATE EXTERNAL TABLE hbase_utenti(key string, val1 string, ... , valX string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf1:a,cf1:b, ... ,cfX:X")
TBLPROPERTIES("hbase.table.name" = "utenti");

This WORKS smoothly, I can query the table from Impala/Hive

 

Now I have the following PySpark script that I want to run via Spark Action (just an excerpt):

 

## IMPORT FUNCTIONS
import sys
import getopt

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import *
from pyspark.sql.types import *

## CREATE SQLCONTEXT
sconf = SparkConf().setAppName("ChurnVoip").setMaster("yarn-cluster").set("spark.driver.memory", "3g").set("spark.executor.memory", "1g").set("spark.executor.cores", 2)
sc = SparkContext(conf=sconf)
sqlContext = HiveContext(sc)

## CREATE MAIN DATAFRAME
hbase_utenti_DF = sqlContext.table("msgnet.hbase_utenti")

## SAVE HBASE TABLE AS "REAL" HIVE TABLE
hbase_utenti_filtered_DF = hbase_utenti_DF \
.select(hbase_utenti_DF["USERID"].alias("HHUTE_USERID"), \
        hbase_utenti_DF["RIVENDITORE"].alias("HHUTE_RIVENDITORE"), \
        hbase_utenti_DF["DATA_COLL"].alias("HHUTE_DATA_COLL"), \
        hbase_utenti_DF["CREDITO"].alias("HHUTE_CREDITO"), \
        hbase_utenti_DF["VIRTUAL"].alias("HHUTE_VIRTUAL"), \
        hbase_utenti_DF["POST"].alias("HHUTE_POST")) \
.filter("((HHUTE_RIVENDITORE = 1) and \
        (HHUTE_DATA_COLL='2016-12-30T00:00:00Z'))") \
.saveAsTable('msgnet.analisi_voipout_utenti_credito_2016_rif')

Please note that if I run the above Python script via "spark-submit" OR if I run Pyspark shell and copy-paste the code into it, IT WORKS smoothly

 

 

Problem arise when I try to schedule an Oozie Spark Action doing the job. In this case, I get the following exception:

 

...
2017-02-05 10:47:52,106 ERROR [Thread-8] hive.log (MetaStoreUtils.java:getDeserializer(387)) - error in initSerDe: java.lang.ClassNotFoundException Class org.apache.hadoop.hive.hbase.HBaseSerDe not found
java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.hbase.HBaseSerDe not found
...
hbase_utenti_DF = sqlContext.table("msgnet.hbase_utenti")
  File "/hdp01/yarn/nm/usercache/msgnet/appcache/application_1486053545979_0053/container_1486053545979_0053_02_000001/pyspark.zip/pyspark/sql/context.py", line 593, in table
---
py4j.protocol.Py4JJavaError: An error occurred while calling o56.table.
: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.org.apache.hadoop.hive.hbase.HBaseStorageHandler

Now, I'm pretty sure the whole problem boils down to Jars not being included. But I've tried and tried to add Paths and Variables to my Oozie Spark Action, with no success. What I've tried to do includes the following:

 

- Adding the following "Hadoop Properties" to my Workflow Settings (gear icon on top right of the Workflow in Hue):

 

spark.executor.extraClassPath    /opt/cloudera/parcels/CDH/lib/hive/lib/*
spark.executor.extraClassPath    /opt/cloudera/parcels/CDH/lib/hbase/lib/*

- Adding the following to my Spark Action's 

 

--files

hdfs:///user/<XXX>/hive-site.xml,
hdfs:///user/<XXX>/hive-hbase-handler-1.1.0-cdh5.7.0.jar,
hdfs:///user/<XXX>/hbase-client-1.2.0-cdh5.7.0.jar,
hdfs:///user/<XXX>/hbase-server-1.2.0-cdh5.7.0.jar,
hdfs:///user/<XXX>/hbase-hadoop-compat-1.2.0-cdh5.7.0.jar,
hdfs:///user/<XXX>/hbase-hadoop2-compat-1.2.0-cdh5.7.0.jar,
hdfs:///user/<XXX>/metrics-core-2.2.0.jar

--driver-class-path /etc/hbase/conf:/opt/cloudera/parcels/CDH/lib/hbase/lib/*

Can anybody please shed some light on what I'm missing here? I'm really out of ideas, after trying and trying.

 

Thanks for any insight!

 

 

 
Contributor
Posts: 49
Registered: ‎01-05-2016

Re: Oozie Spark Action - Can't read from Hive table created as External Table pointing to Hbase tabl

[ Edited ]

Update: Adding the following comma separated list of jars to the Spark Action's "Options List" I've been able to get past the Java exceptions:

 

--jars hdfs:///user/xxx/ETL/SCRIPTS/SPARK/hive-hbase-handler-1.1.0-cdh5.7.0.jar,hdfs:///user/xxx/ETL/SCRIPTS/SPARK/hbase-server-1.2.0-cdh5.7.0.jar,hdfs:///user/xxx/ETL/SCRIPTS/SPARK/hbase-client-1.2.0-cdh5.7.0.jar

Now I'm facing another problem. The workflow still fails, and in the Spark Action's log I see the following error:

 

Traceback (most recent call last):
  File "AnalisiVoipOut1.py", line 42, in <module>
...
pyspark.sql.utils.AnalysisException: u'path hdfs://myhostname.mydomain.it:8020/user/hive/warehouse/xxx.db/analisi_voipout_utenti_credito_2016_rif already exists.;' 2017-02-05 22:46:51,405 ERROR [Driver] yarn.ApplicationMaster (Logging.scala:logError(74)) - User application exited with status 1

Now, I can confirm that even changing the Table's name every time (therefore: THAT PATH DOES'NT EXIST), I still get the error everytime I run the workflow.

 

I've alsto thought that, maybe, given that I was using 4 executors, maybe fore some strange reason they got in confilct generating the output directory on hdfs. Therefore I also made several tries with just one executor, but with no success.

 

This "HBaseStorageHandler" thing is really puzzling me, can anybody help me in getting past this last error?

 

Thanks

Posts: 642
Topics: 3
Kudos: 121
Solutions: 67
Registered: ‎08-16-2016

Re: Oozie Spark Action - Can't read from Hive table created as External Table pointing to Hbase tabl

I have not seen this error but I have seen other issues and strange behavior when using saveAsTable with the HiveContext. I always create the table schema ahead in Hive and then just use the write method.

That is not a great answer but it is my experience. It doesn't explain why running with spark-submit and pyspark shell didn't have the issue.

You may have more going on in your Spark code but what is shown here can be done with a one line SQL command so this seems like forcing it into Spark (especially when you can set the execution engine in Hive to Spark).
Contributor
Posts: 49
Registered: ‎01-05-2016

Re: Oozie Spark Action - Can't read from Hive table created as External Table pointing to Hbase tabl

Hi, thanks for your reply. I have tried to create the Table first, and indeed the overall behaviour changed. Now I get a different exception, which I'm going to paste just below.

 

Strange thing is that the class referenced in the exception, "ClientBackoffPolicyFactory", IS loaded and present (it's in "--jars", as detailed in earlier posts above this one).

 

Here is the main excerpt from the error stack:

 

 

2017-02-06 19:31:51,426 WARN  [Thread-8] ipc.RpcControllerFactory (RpcControllerFactory.java:instantiate(78)) - Cannot load configured "hbase.rpc.controllerfactory.class" (org.apache.hadoop.hbase.ipc.controller.ServerRpcControllerFactory) from hbase-site.xml, falling back to use default RpcControllerFactory
2017-02-06 19:31:51,431 ERROR [Thread-8] datasources.InsertIntoHadoopFsRelation (Logging.scala:logError(95)) - Aborting job.
java.io.IOException: java.lang.reflect.InvocationTargetException

...
Caused by: java.lang.reflect.InvocationTargetException
...
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.client.backoff.ClientBackoffPolicyFactory$NoBackoffPolicy
...
Caused by: java.io.IOException: java.lang.reflect.InvocationTargetException
...
Caused by: java.lang.reflect.InvocationTargetException
...
Caused by: java.lang.UnsupportedOperationException: Unable to find org.apache.hadoop.hbase.client.backoff.ClientBackoffPolicyFactory$NoBackoffPolicy
...
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.client.backoff.ClientBackoffPolicyFactory$NoBackoffPolicy
... 2017-02-06 19:31:51,538 ERROR [Driver] yarn.ApplicationMaster (Logging.scala:logError(74)) - User application exited with status 1

 

Given that "hbase-site.xml" is mentioned in the error stack, I'm also pasting that file just below:

 

<?xml version="1.0" encoding="UTF-8"?>

<!--Autogenerated by Cloudera Manager-->
<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://xxx01.yyy.it:8020/hbase</value>
  </property>
  <property>
    <name>hbase.replication</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.client.write.buffer</name>
    <value>2097152</value>
  </property>
  <property>
    <name>hbase.client.pause</name>
    <value>100</value>
  </property>
  <property>
    <name>hbase.client.retries.number</name>
    <value>35</value>
  </property>
  <property>
    <name>hbase.client.scanner.caching</name>
    <value>100</value>
  </property>
  <property>
    <name>hbase.client.keyvalue.maxsize</name>
    <value>10485760</value>
  </property>
  <property>
    <name>hbase.ipc.client.allowsInterrupt</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.client.primaryCallTimeout.get</name>
    <value>10</value>
  </property>
  <property>
    <name>hbase.client.primaryCallTimeout.multiget</name>
    <value>10</value>
  </property>
  <property>
    <name>hbase.regionserver.thrift.http</name>
    <value>false</value>
  </property>
  <property>
    <name>hbase.thrift.support.proxyuser</name>
    <value>false</value>
  </property>
  <property>
    <name>hbase.rpc.timeout</name>
    <value>60000</value>
  </property>
  <property>
    <name>hbase.snapshot.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.snapshot.master.timeoutMillis</name>
    <value>60000</value>
  </property>
  <property>
    <name>hbase.snapshot.region.timeout</name>
    <value>60000</value>
  </property>
  <property>
    <name>hbase.snapshot.master.timeout.millis</name>
    <value>60000</value>
  </property>
  <property>
    <name>hbase.security.authentication</name>
    <value>simple</value>
  </property>
  <property>
    <name>hbase.rpc.protection</name>
    <value>authentication</value>
  </property>
  <property>
    <name>zookeeper.session.timeout</name>
    <value>60000</value>
  </property>
  <property>
    <name>zookeeper.znode.parent</name>
    <value>/hbase</value>
  </property>
  <property>
    <name>zookeeper.znode.rootserver</name>
    <value>root-region-server</value>
  </property>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>xxx02.yyy.it,xxx01.yyy.it,xxx03.yyy.it</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.clientPort</name>
    <value>2181</value>
  </property>
  <property>
    <name>hbase.rest.ssl.enabled</name>
    <value>false</value>
  </property>
</configuration>

 

Thanks for any help... Meanwhile, I'll go on testing things, but maybe I'll try to find a workaround and do something completely different. This is starting to be a bit too much for my skills / patience :)

Announcements