question Re: Problems using hbase-spark on CDH in Support Questions

Problems using hbase-spark on CDH

Jean-Luc — Thu, 27 Oct 2022 16:42:47 GMT

Hello,

I am trying to use hbase-spark in order to query over Hbase with spark-sql, but I am stuck with some of these exceptions:

java.lang.NullPointerException

java.lang.NoSuchMethodError: org.apache.hadoop.hbase.util.ByteStringer.wrap([B)Lcom/google/protobuf/ByteString;

Contents :

1- Details on the platform

2- Details on the problem

3- Description of the attached content

1- A Cloudera cluster running CDH-6.3.4

2- I run a (POC) Java application with spark-submit in yarn cluster mode. What the application does is to sequencially :

- Create a Hbase table and populate it using the Java API

- Use a spark-session to read the hbase table (using .format("org.apache.hadoop.hbase.spark") )

- Perform some queries on the Dataframe.

For now, I only got partial success on the last step. I can show the contents of the dataframe with this for example :

Dataset<Row> sqlDF1 = sqlContext.sql("SELECT * FROM census2");
sqlDF1.show(100, false);

But the following code fails with one of the two exceptions listed at the start of the post :

Dataset<Row> sqlDF2 = sqlContext.sql("SELECT * FROM census2 WHERE ID1 LIKE '____|001_|%'");
sqlDF2.show(100, false);

Concerning this exception :

java.lang.NoSuchMethodError: org.apache.hadoop.hbase.util.ByteStringer.wrap([B)Lcom/google/protobuf/ByteString;

I see that the class is provided by the hbase-protocol project, as can be seen here : hbase/ByteStringer.java at rel/2.1.0 · apache/hbase · GitHub

I included the jar with the --jar option in the spark-submit, and it is also present in the uber-jar that is launched. So I don't see why I get this error.

I tried to use both hbase-spark 2.1.0-cdh6.3.4 maven-central dependency as well as a hbase-spark library that I compiled myself, but it did not help.

I also tried to add this to my sparksession :

//                .config("spark.jars", "hbase-spark-1.0.0.jar:hbase-protocol-2.1.0.jar")

But then I get NullPointerException and cannot even print the dataframes.

I also tried to add :

.option("hbase.spark.use.hbasecontext", false)

when reading the dataframe (as I found someone suggesting that) but it did not help either.

3- Description of the attached content

- Main.java.txt => the code of the sample application

- launcher.sh.txt => the bash code used to launch the application

- jars_and_classpaths.txt => the jars passed to the --jars command, as well as the java client classpath

- mvn_dependency_tree.txt => the results of the command mvn dependency:tree

I am stuck here, could someone help me ?

Thanks a lot

Re: Problems using hbase-spark on CDH

RangaReddy — Fri, 28 Oct 2022 03:21:06 GMT

Hi @Jean-Luc

You can try the following example code

https://github.com/rangareddy/ranga_spark_experiments/tree/master/spark_hbase_cdh_integration

Re: Problems using hbase-spark on CDH

Jean-Luc — Fri, 28 Oct 2022 08:24:52 GMT

Hi RangaReddy,

I looked into your solution, however I see that the only actions you perform to test the hbase-spark interaction is this :

employeeDf.printSchema()
employeeDf.show(truncate=false)

But I already have success on both these actions.

What I want to do, is to perform "advanced" sql operations on the dataframe. Namely, filter on the rowkeys and qualifier values.

For example : sqlContext.sql("SELECT * FROM census2 WHERE ID1 LIKE '____|001_|%'");

Did you try to do that in your experiment ?

I found this on the branch master of hbase-connectors/spark (hbase-connectors/spark at master · apache/hbase-connectors · GitHub) :

Server-side (HBase region servers) configuration:

The following jars need to be in the CLASSPATH of the HBase region servers:

scala-library, hbase-spark, and hbase-spark-protocol-shaded.
The server-side configuration is needed for column filter pushdown
if you cannot perform the server-side configuration, consider using .option("hbase.spark.pushdown.columnfilter", false)

So the --jars option of the spark-submit does make the jars accessible to the spark driver and executors, but somehow when you make a qualifier filter operation, spark must be delegating some work to the hbase region servers, and the jars need to be in the classpath of the region servers's Java processes too ?

Thanks,

Re: Problems using hbase-spark on CDH

quangbilly79 — Thu, 08 Dec 2022 08:12:09 GMT

Did you solve this?

Re: Problems using hbase-spark on CDH

quangbilly79 — Fri, 09 Dec 2022 04:15:56 GMT

Nah I figure it out. First, go to /etc/spark/conf.cloudera.spark_on_yarn/classpath.txt then delete the last line (which contains the path to hbase-class.jar). Then you download hbase-spark-1.0.0.7.2.15.0-147.jar, then when you run spark-shell, add --jars pathToYourDownloadedjar, then you add option("hbase.spark.pushdown.columnfilter", false) before load data like this:

val sql = spark.sqlContext

val df = sql.read.format("org.apache.hadoop.hbase.spark").option("hbase.columns.mapping", "name STRING :key, email STRING c:email, " + "birthDate STRING p:birthDate, height FLOAT p:height").option("hbase.table", "person").option("hbase.spark.use.hbasecontext", false).option("hbase.spark.pushdown.columnfilter", false).load()

df.createOrReplaceTempView("personView")

val results = sql.sql("SELECT * FROM personView where name = 'alice'")

results.show()

Re: Problems using hbase-spark on CDH

RangaReddy — Fri, 09 Dec 2022 04:30:39 GMT

Hi @quangbilly79

You have used CDP hbase-spark-1.0.0.7.2.15.0-147.jar instead of CDH. There is no guarantee it will work latest jar in CDH. Luckily for you it is worked.