Created 02-05-2017 02:58 AM
Hello, I am on CDH 5.7
I have an HBase table. To be able to access it from Impala/Hive, I have created a Hive external table pointing to it, using HBaseStorageHandler, using the following approach:
CREATE EXTERNAL TABLE hbase_utenti(key string, val1 string, ... , valX string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf1:a,cf1:b, ... ,cfX:X") TBLPROPERTIES("hbase.table.name" = "utenti");
This WORKS smoothly, I can query the table from Impala/Hive
Now I have the following PySpark script that I want to run via Spark Action (just an excerpt):
## IMPORT FUNCTIONS import sys import getopt from pyspark import SparkConf, SparkContext from pyspark.sql import SQLContext from pyspark.sql import HiveContext from pyspark.sql.functions import * from pyspark.sql.types import * ## CREATE SQLCONTEXT sconf = SparkConf().setAppName("ChurnVoip").setMaster("yarn-cluster").set("spark.driver.memory", "3g").set("spark.executor.memory", "1g").set("spark.executor.cores", 2) sc = SparkContext(conf=sconf) sqlContext = HiveContext(sc) ## CREATE MAIN DATAFRAME hbase_utenti_DF = sqlContext.table("msgnet.hbase_utenti") ## SAVE HBASE TABLE AS "REAL" HIVE TABLE hbase_utenti_filtered_DF = hbase_utenti_DF \ .select(hbase_utenti_DF["USERID"].alias("HHUTE_USERID"), \ hbase_utenti_DF["RIVENDITORE"].alias("HHUTE_RIVENDITORE"), \ hbase_utenti_DF["DATA_COLL"].alias("HHUTE_DATA_COLL"), \ hbase_utenti_DF["CREDITO"].alias("HHUTE_CREDITO"), \ hbase_utenti_DF["VIRTUAL"].alias("HHUTE_VIRTUAL"), \ hbase_utenti_DF["POST"].alias("HHUTE_POST")) \ .filter("((HHUTE_RIVENDITORE = 1) and \ (HHUTE_DATA_COLL='2016-12-30T00:00:00Z'))") \ .saveAsTable('msgnet.analisi_voipout_utenti_credito_2016_rif')
Please note that if I run the above Python script via "spark-submit" OR if I run Pyspark shell and copy-paste the code into it, IT WORKS smoothly
Problem arise when I try to schedule an Oozie Spark Action doing the job. In this case, I get the following exception:
... 2017-02-05 10:47:52,106 ERROR [Thread-8] hive.log (MetaStoreUtils.java:getDeserializer(387)) - error in initSerDe: java.lang.ClassNotFoundException Class org.apache.hadoop.hive.hbase.HBaseSerDe not found java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.hbase.HBaseSerDe not found ... hbase_utenti_DF = sqlContext.table("msgnet.hbase_utenti") File "/hdp01/yarn/nm/usercache/msgnet/appcache/application_1486053545979_0053/container_1486053545979_0053_02_000001/pyspark.zip/pyspark/sql/context.py", line 593, in table --- py4j.protocol.Py4JJavaError: An error occurred while calling o56.table. : java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.org.apache.hadoop.hive.hbase.HBaseStorageHandler
Now, I'm pretty sure the whole problem boils down to Jars not being included. But I've tried and tried to add Paths and Variables to my Oozie Spark Action, with no success. What I've tried to do includes the following:
- Adding the following "Hadoop Properties" to my Workflow Settings (gear icon on top right of the Workflow in Hue):
spark.executor.extraClassPath /opt/cloudera/parcels/CDH/lib/hive/lib/* spark.executor.extraClassPath /opt/cloudera/parcels/CDH/lib/hbase/lib/*
- Adding the following to my Spark Action's
--files hdfs:///user/<XXX>/hive-site.xml, hdfs:///user/<XXX>/hive-hbase-handler-1.1.0-cdh5.7.0.jar, hdfs:///user/<XXX>/hbase-client-1.2.0-cdh5.7.0.jar, hdfs:///user/<XXX>/hbase-server-1.2.0-cdh5.7.0.jar, hdfs:///user/<XXX>/hbase-hadoop-compat-1.2.0-cdh5.7.0.jar, hdfs:///user/<XXX>/hbase-hadoop2-compat-1.2.0-cdh5.7.0.jar, hdfs:///user/<XXX>/metrics-core-2.2.0.jar --driver-class-path /etc/hbase/conf:/opt/cloudera/parcels/CDH/lib/hbase/lib/*
Can anybody please shed some light on what I'm missing here? I'm really out of ideas, after trying and trying.
Thanks for any insight!
Created on 02-05-2017 04:02 PM - edited 02-05-2017 04:05 PM
Update: Adding the following comma separated list of jars to the Spark Action's "Options List" I've been able to get past the Java exceptions:
--jars hdfs:///user/xxx/ETL/SCRIPTS/SPARK/hive-hbase-handler-1.1.0-cdh5.7.0.jar,hdfs:///user/xxx/ETL/SCRIPTS/SPARK/hbase-server-1.2.0-cdh5.7.0.jar,hdfs:///user/xxx/ETL/SCRIPTS/SPARK/hbase-client-1.2.0-cdh5.7.0.jar
Now I'm facing another problem. The workflow still fails, and in the Spark Action's log I see the following error:
Traceback (most recent call last): File "AnalisiVoipOut1.py", line 42, in <module> ...
pyspark.sql.utils.AnalysisException: u'path hdfs://myhostname.mydomain.it:8020/user/hive/warehouse/xxx.db/analisi_voipout_utenti_credito_2016_rif already exists.;' 2017-02-05 22:46:51,405 ERROR [Driver] yarn.ApplicationMaster (Logging.scala:logError(74)) - User application exited with status 1
Now, I can confirm that even changing the Table's name every time (therefore: THAT PATH DOES'NT EXIST), I still get the error everytime I run the workflow.
I've alsto thought that, maybe, given that I was using 4 executors, maybe fore some strange reason they got in confilct generating the output directory on hdfs. Therefore I also made several tries with just one executor, but with no success.
This "HBaseStorageHandler" thing is really puzzling me, can anybody help me in getting past this last error?
Thanks
Created 02-05-2017 09:45 PM
Created 02-06-2017 10:58 AM
Hi, thanks for your reply. I have tried to create the Table first, and indeed the overall behaviour changed. Now I get a different exception, which I'm going to paste just below.
Strange thing is that the class referenced in the exception, "ClientBackoffPolicyFactory", IS loaded and present (it's in "--jars", as detailed in earlier posts above this one).
Here is the main excerpt from the error stack:
2017-02-06 19:31:51,426 WARN [Thread-8] ipc.RpcControllerFactory (RpcControllerFactory.java:instantiate(78)) - Cannot load configured "hbase.rpc.controllerfactory.class" (org.apache.hadoop.hbase.ipc.controller.ServerRpcControllerFactory) from hbase-site.xml, falling back to use default RpcControllerFactory 2017-02-06 19:31:51,431 ERROR [Thread-8] datasources.InsertIntoHadoopFsRelation (Logging.scala:logError(95)) - Aborting job. java.io.IOException: java.lang.reflect.InvocationTargetException
...
Caused by: java.lang.reflect.InvocationTargetException
...
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.client.backoff.ClientBackoffPolicyFactory$NoBackoffPolicy
...
Caused by: java.io.IOException: java.lang.reflect.InvocationTargetException
...
Caused by: java.lang.reflect.InvocationTargetException
...
Caused by: java.lang.UnsupportedOperationException: Unable to find org.apache.hadoop.hbase.client.backoff.ClientBackoffPolicyFactory$NoBackoffPolicy
...
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.client.backoff.ClientBackoffPolicyFactory$NoBackoffPolicy
... 2017-02-06 19:31:51,538 ERROR [Driver] yarn.ApplicationMaster (Logging.scala:logError(74)) - User application exited with status 1
Given that "hbase-site.xml" is mentioned in the error stack, I'm also pasting that file just below:
<?xml version="1.0" encoding="UTF-8"?> <!--Autogenerated by Cloudera Manager--> <configuration> <property> <name>hbase.rootdir</name> <value>hdfs://xxx01.yyy.it:8020/hbase</value> </property> <property> <name>hbase.replication</name> <value>true</value> </property> <property> <name>hbase.client.write.buffer</name> <value>2097152</value> </property> <property> <name>hbase.client.pause</name> <value>100</value> </property> <property> <name>hbase.client.retries.number</name> <value>35</value> </property> <property> <name>hbase.client.scanner.caching</name> <value>100</value> </property> <property> <name>hbase.client.keyvalue.maxsize</name> <value>10485760</value> </property> <property> <name>hbase.ipc.client.allowsInterrupt</name> <value>true</value> </property> <property> <name>hbase.client.primaryCallTimeout.get</name> <value>10</value> </property> <property> <name>hbase.client.primaryCallTimeout.multiget</name> <value>10</value> </property> <property> <name>hbase.regionserver.thrift.http</name> <value>false</value> </property> <property> <name>hbase.thrift.support.proxyuser</name> <value>false</value> </property> <property> <name>hbase.rpc.timeout</name> <value>60000</value> </property> <property> <name>hbase.snapshot.enabled</name> <value>true</value> </property> <property> <name>hbase.snapshot.master.timeoutMillis</name> <value>60000</value> </property> <property> <name>hbase.snapshot.region.timeout</name> <value>60000</value> </property> <property> <name>hbase.snapshot.master.timeout.millis</name> <value>60000</value> </property> <property> <name>hbase.security.authentication</name> <value>simple</value> </property> <property> <name>hbase.rpc.protection</name> <value>authentication</value> </property> <property> <name>zookeeper.session.timeout</name> <value>60000</value> </property> <property> <name>zookeeper.znode.parent</name> <value>/hbase</value> </property> <property> <name>zookeeper.znode.rootserver</name> <value>root-region-server</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>xxx02.yyy.it,xxx01.yyy.it,xxx03.yyy.it</value> </property> <property> <name>hbase.zookeeper.property.clientPort</name> <value>2181</value> </property> <property> <name>hbase.rest.ssl.enabled</name> <value>false</value> </property> </configuration>
Thanks for any help... Meanwhile, I'll go on testing things, but maybe I'll try to find a workaround and do something completely different. This is starting to be a bit too much for my skills / patience :)