Support Questions

Find answers, ask questions, and share your expertise

Spark HBase Connector (SHC) job fails to connect to Zookeeper cause connection faillure to HBase

avatar
Contributor
spark-shell.txt

Hello,

I am trying to execute a basic code using the shc connector. It is a connector apparently provide by Hortonworks (in their github at least) that conveniently allows to insert/request data on HBase. So the code rework from the example of the project is building a Dataframe of fake data and try to insert it via the connector.

This is the hbase configuration:

screenshot-from-2017-01-31-13-47-21.png

The code is run under a spark shell which launching command line is the following:

spark-shell --master yarn \
            --deploy-mode client \
            --name "hive2hbase" \
            --repositories "http://repo.hortonworks.com/content/groups/public/" \
            --packages "com.hortonworks:shc-core:1.0.1-1.6-s_2.10" \
            --files "/usr/hdp/current/hbase-client/conf/hbase-site.xml,/usr/hdp/current/hive-client/conf/hive-site.xml" \
            --jars /usr/hdp/current/phoenix-client/phoenix-server.jar
            --driver-memory 1G \
            --executor-memory 1500m \
            --num-executors 8

The log of spark shell tells me that it correctly load the hbase-site.xml and hive-site.xml files. I also checked that the configuration of the zookeeper quorum is correct in the HBase configuration. However the zookeeper objects are failing to connect because they are trying quorum:localhost:2081 instead of the addresses of the one of the three zookeeper nodes.

As a consequence it also fails to give me the HBase connection that is needed.

Note: I already tried to erase from the zookeeper command line the configuration relative to hbase (/hbase-unsecure) and restart zookeeper so as to let him rebuild it but this solution fails also.

Thanks for any help that may be provided

1 ACCEPTED SOLUTION

avatar
Contributor

Samuel,

Just for the sake of narrowing down the issue, can you add the hbase-site.xml, hive-site.xml to SPARK_CLASSPATH and retry ?

View solution in original post

15 REPLIES 15

avatar
Contributor

Samuel,

Have you tried by explicitly exporting the hbase-site.xml to SPARK_CLASSPATH ?

Your logs show that the hbase base znode is /hbase, whereas the hbase-site.xml shows that the base znode is /hbase-unsecure. This indicates that spark hbase connector is not looking at the correct hbase-site.xml.

avatar
Contributor

I am going to do it. I had production problem until now that kept me out of the problem. It is not close and I will try with your advice. Thanks.

avatar
Contributor

Thanks for your advice it seems this is the problem.

As a test I ran the example of the shc connector here with --master yarn-cluster and --master yarn-client and this was the problem. The quorum are respectively found/not found in each test. So spark doest not have the file in its path when working as a client.

avatar
Rising Star

Hi Samuel,

You probably need to copy hbase-site.xml to /etc/spark/conf folder.

avatar
Contributor

No this is absolutely not the problem. It does not guarantee in any manner that the spark job will take it into account. See answer to @anatva for a proper answer to this.

Further more my post indicate that --files option is used with the correct files passed.

avatar
New Contributor

This is the code

ubuntu@ip-10-0-2-24:~$ spark-shell --packages com.hortonworks:shc-core:1.1.0-2.1 -s_2.11 --repositories http://repo.hortonworks.com/content/groups/public/ scala> import org.apache.spark.sql.{SQLContext, _} import org.apache.spark.sql.{SQLContext, _}

scala> import org.apache.spark.sql.execution.datasources.hbase._

import org.apache.spark.sql.execution.datasources.hbase._

scala> import org.apache.spark.{SparkConf, SparkContext}

import org.apache.spark.{SparkConf, SparkContext}

scala> import spark.sqlContext.implicits._

import spark.sqlContext.implicits._

scala> def catalog = s"""{ | |"table":{"namespace":"default", "name":"Contacts"}, | |"rowkey":"key", | |"columns":{ | |"rowkey":{"cf":"rowkey", "col":"key", "type":"string"}, | |"officeAddress":{"cf":"Office", "col":"Address", "type":"string"}, | |"officePhone":{"cf":"Office", "col":"Phone", "type":"string"}, | |"personalName":{"cf":"Personal", "col":"Name", "type":"string"}, | |"personalPhone":{"cf":"Personal", "col":"Phone", "type":"string"} | |} | |}""".stripMargin catalog: String

scala> def withCatalog(cat: String): DataFrame = { | spark.sqlContext | .read | .options(Map(HBaseTableCatalog.tableCatalog->cat)) | .format("org.apache.spark.sql.execution.datasources.hbase") | .load() | }

withCatalog: (cat: String)org.apache.spark.sql.DataFrame

scala> val df = withCatalog(catalog) df: org.apache.spark.sql.DataFrame = [rowkey: string, officeAddress: string ... 3 more fields]

this is the error

scala> df.show

java.lang.RuntimeException: java.lang.NullPointerException at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:208) at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320) at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:295) at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160) at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:155) at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:821) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:193) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:89) at org.apache.hadoop.hbase.client.MetaScanner.listTableRegionLocations(MetaScanner.java:343) at org.apache.hadoop.hbase.client.HRegionLocator.listRegionLocations(HRegionLocator.java:142) at org.apache.hadoop.hbase.client.HRegionLocator.getStartEndKeys(HRegionLocator.java:118) at org.apache.spark.sql.execution.datasources.hbase.RegionResource$anonfun$1.apply(HBaseResources.scala:109) at org.apache.spark.sql.execution.datasources.hbase.RegionResource$anonfun$1.apply(HBaseResources.scala:108) at org.apache.spark.sql.execution.datasources.hbase.ReferencedResource$class.releaseOnException(HBaseResources.scala:77) at org.apache.spark.sql.execution.datasources.hbase.RegionResource.releaseOnException(HBaseResources.scala:88) at org.apache.spark.sql.execution.datasources.hbase.RegionResource.<init>(HBaseResources.scala:108) at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD.getPartitions(HBaseTableScan.scala:61) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:314) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$collectFromPlan(Dataset.scala:2861) at org.apache.spark.sql.Dataset$anonfun$head$1.apply(Dataset.scala:2150) at org.apache.spark.sql.Dataset$anonfun$head$1.apply(Dataset.scala:2150) at org.apache.spark.sql.Dataset$anonfun$55.apply(Dataset.scala:2842) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841) at org.apache.spark.sql.Dataset.head(Dataset.scala:2150) at org.apache.spark.sql.Dataset.take(Dataset.scala:2363) at org.apache.spark.sql.Dataset.showString(Dataset.scala:241) at org.apache.spark.sql.Dataset.show(Dataset.scala:637) at org.apache.spark.sql.Dataset.show(Dataset.scala:596) at org.apache.spark.sql.Dataset.show(Dataset.scala:605) ... 54 elided Caused by: java.lang.NullPointerException at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.getMetaReplicaNodes(ZooKeeperWatcher.java:395) at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:553) at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:61) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1185) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1152) at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:300) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:151) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:59) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200) ... 103 more