Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

RDD-based Spark-HBase Connector in HDP 2.6

Highlighted

RDD-based Spark-HBase Connector in HDP 2.6

Expert Contributor

I just found this article in the HDP documentation:

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/bk_spark-component-guide/content/connectors...

It shows two different connectors for Spark to read/write data from/to HBase. The Hortonworks Spark-HBase Connector works only with a fixed Schema, while RDD-based Spark-HBase Connector should be the right connector when working with RDDs and with different data, where the schema is undefined at ingestion time.

However, since I'm using the actual HDP 2.6.3 installation on my Cluster, HBase 1.1.2 is installed. As I searched around for the RDD-based connector, I only found it for HBase version 2.0.0:

<dependency>
  <groupId>org.apache.hbase</groupId>
  <artifactId>hbase-spark</artifactId>
  <version>2.0.0-alpha4</version>
</dependency>

Now my question here: Is there a version of this RDD-based connector even for HBase 1.1.2? And if not: Why is this connector listed in the HDP 2.6.2 / 2.6.3. documentation?

UPDATE:

Just downloaded the matching HBase repository from Hortonworks Github here: https://github.com/hortonworks/hbase-release/archive/HDP-2.6.3.0-235-tag.zip and tried to change the Spark version in hbase-spark/pom.xml. This also doesn't work and I'm getting this failure when running mv clean install -DskipTests:

[INFO] ------------------------------------------------------------------------
[INFO] Building Apache HBase - Spark 1.1.2.2.6.3.0-235
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ hbase-spark ---
[INFO] Deleting /home/dmueller/hbase-release-HDP-2.6.3.0-235-tag/hbase-spark/target
[INFO]
[INFO] --- build-helper-maven-plugin:1.9.1:add-source (add-source) @ hbase-spark ---
[INFO] Source directory: /home/dmueller/hbase-release-HDP-2.6.3.0-235-tag/hbase-spark/src/main/scala added.
[INFO]
[INFO] --- build-helper-maven-plugin:1.9.1:add-test-source (add-test-source) @ hbase-spark ---
[INFO] Test Source directory: /home/dmueller/hbase-release-HDP-2.6.3.0-235-tag/hbase-spark/src/test/scala added.
[INFO]
[INFO] --- maven-enforcer-plugin:1.0.1:enforce (enforce) @ hbase-spark ---
[INFO]
[INFO] --- maven-enforcer-plugin:1.0.1:enforce (banned-jsr305) @ hbase-spark ---
[INFO]
[INFO] --- buildnumber-maven-plugin:1.3:create-timestamp (default) @ hbase-spark ---
[INFO]
[INFO] --- jacoco-maven-plugin:0.6.2.201302030002:prepare-agent (prepare-agent) @ hbase-spark ---
[INFO] Skipping JaCoCo execution
[INFO] argLine set to
[INFO]
[INFO] --- maven-remote-resources-plugin:1.5:process (default) @ hbase-spark ---
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ hbase-spark ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /home/dmueller/hbase-release-HDP-2.6.3.0-235-tag/hbase-spark/src/main/resources
[INFO] Copying 3 resources
[INFO]
[INFO] --- scala-maven-plugin:3.2.0:add-source (scala-compile-first) @ hbase-spark ---
[INFO]
[INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @ hbase-spark ---
[WARNING]  Expected all dependencies to require Scala version: 2.10.5
[WARNING]  org.apache.hbase:hbase-spark:1.1.2.2.6.3.0-235 requires scala version: 2.10.5
[WARNING]  org.apache.spark:spark-streaming_2.10:1.6.3 requires scala version: 2.10.5
[WARNING]  org.apache.spark:spark-streaming_2.10:1.6.3 requires scala version: 2.10.5
[WARNING]  org.scalatest:scalatest_2.10:2.2.4 requires scala version: 2.10.4
[WARNING] Multiple versions of scala libraries detected!
[INFO] /home/dmueller/hbase-release-HDP-2.6.3.0-235-tag/hbase-spark/src/main/java:-1: info: compiling
[INFO] /home/dmueller/hbase-release-HDP-2.6.3.0-235-tag/hbase-spark/src/main/scala:-1: info: compiling
[INFO] Compiling 52 source files to /home/dmueller/hbase-release-HDP-2.6.3.0-235-tag/hbase-spark/target/classes at 1513759118588
[WARNING] /home/dmueller/hbase-release-HDP-2.6.3.0-235-tag/hbase-spark/src/main/scala/org/apache/hadoop/hbase/spark/DefaultSource.scala:26: warning: imported `Logging' is permanently hidden by definition of object Logging in package spark
[WARNING] import org.apache.hadoop.hbase.spark.Logging
[WARNING]                                      ^
[WARNING] /home/dmueller/hbase-release-HDP-2.6.3.0-235-tag/hbase-spark/src/main/scala/org/apache/hadoop/hbase/spark/HBaseConnectionCache.scala:28: warning: imported `Logging' is permanently hidden by definition of object Logging in package spark
[WARNING] import org.apache.hadoop.hbase.spark.Logging
[WARNING]                                      ^
[WARNING] /home/dmueller/hbase-release-HDP-2.6.3.0-235-tag/hbase-spark/src/main/scala/org/apache/hadoop/hbase/spark/HBaseContext.scala:41: warning: imported `Logging' is permanently hidden by definition of object Logging in package spark
[WARNING] import org.apache.hadoop.hbase.spark.Logging
[WARNING]                                      ^
[ERROR] /home/dmueller/hbase-release-HDP-2.6.3.0-235-tag/hbase-spark/src/main/scala/org/apache/hadoop/hbase/spark/JavaHBaseContext.scala:113: error: type mismatch;
[ERROR]  found   : Iterable[R]
[ERROR]  required: java.util.Iterator[?]
[ERROR]         asScalaIterator(iter)
[ERROR]                         ^
[ERROR] /home/dmueller/hbase-release-HDP-2.6.3.0-235-tag/hbase-spark/src/main/scala/org/apache/spark/sql/datasources/hbase/DataTypeParserWrapper.scala:20: error: object parser is not a member of package org.apache.spark.sql.catalyst
[ERROR] import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
[ERROR]                                      ^
[ERROR] /home/dmueller/hbase-release-HDP-2.6.3.0-235-tag/hbase-spark/src/main/scala/org/apache/spark/sql/datasources/hbase/DataTypeParserWrapper.scala:34: error: not found: value CatalystSqlParser
[ERROR]   def parse(dataTypeString: String): DataType = CatalystSqlParser.parseDataType(dataTypeString)
[ERROR]                                                 ^
[WARNING] three warnings found
[ERROR] three errors found
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Apache HBase ...................................... SUCCESS [2.159s]
[INFO] Apache HBase - Checkstyle ......................... SUCCESS [0.449s]
[INFO] Apache HBase - Resource Bundle .................... SUCCESS [0.216s]
[INFO] Apache HBase - Annotations ........................ SUCCESS [0.752s]
[INFO] Apache HBase - Protocol ........................... SUCCESS [11.209s]
[INFO] Apache HBase - Common ............................. SUCCESS [4.791s]
[INFO] Apache HBase - Procedure .......................... SUCCESS [0.856s]
[INFO] Apache HBase - Client ............................. SUCCESS [3.999s]
[INFO] Apache HBase - Hadoop Compatibility ............... SUCCESS [0.302s]
[INFO] Apache HBase - Hadoop Two Compatibility ........... SUCCESS [0.771s]
[INFO] Apache HBase - Prefix Tree ........................ SUCCESS [0.763s]
[INFO] Apache HBase - Server ............................. SUCCESS [18.055s]
[INFO] Apache HBase - Testing Util ....................... SUCCESS [0.807s]
[INFO] Apache HBase - Thrift ............................. SUCCESS [4.088s]
[INFO] Apache HBase - Rest ............................... SUCCESS [2.092s]
[INFO] Apache HBase - RSGroup ............................ SUCCESS [0.822s]
[INFO] Apache HBase - Shell .............................. SUCCESS [0.612s]
[INFO] Apache HBase - Integration Tests .................. SUCCESS [1.487s]
[INFO] Apache HBase - Examples ........................... SUCCESS [0.834s]
[INFO] Apache HBase - Spark .............................. FAILURE [8.847s]
[INFO] Apache HBase - Assembly ........................... SKIPPED
[INFO] Apache HBase - Shaded ............................. SKIPPED
[INFO] Apache HBase - Shaded - Client .................... SKIPPED
[INFO] Apache HBase - Shaded - Server .................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1:04.951s
[INFO] Finished at: Wed Dec 20 09:38:45 CET 2017
[INFO] Final Memory: 198M/3389M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project hbase-spark: wrap: org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :hbase-spark

I already tried to build HBase with different Spark and Scala versions, it only works for these settings (in ./hbase-spark/pom.xml):

<properties>
        <spark.version>2.1.1</spark.version>
        <scala.version>2.11.8</scala.version>
        <scala.binary.version>2.11</scala.binary.version>
        <surefire.skipSecondPart>true</surefire.skipSecondPart>
        <top.dir>${project.basedir}/..</top.dir>
        <avro.version>1.7.6</avro.version>
        <avro.mapred.classifier></avro.mapred.classifier>
    </properties>
8 REPLIES 8

Re: RDD-based Spark-HBase Connector in HDP 2.6

@Daniel Müller: It is available in Hortonworks repo.

Use hortonworks repo in your pom.xml : http://repo.hortonworks.com/content/repositories/releases

And version in your dependency as : 1.1.2.2.6.3.0-235 (combination of hbase and HDP version)

Reference pom is at : https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.3/bk_spark-component-guide/content/using-spar...

Re: RDD-based Spark-HBase Connector in HDP 2.6

Expert Contributor

Perfect, thank you so much for your answer! Now I'm getting an exception based on Scala and Spark versions I guess.

The pom.xml in http://repo.hortonworks.com/content/repositories/releases/org/apache/hbase/hbase-spark/1.1.2.2.6.3.0... sets the wrong Spark and Scala version:

<spark.version>2.1.1</spark.version>
<scala.version>2.11.8</scala.version><strong><br></strong>

How to get the JAR for my Spark (1.6.3) and Scala (2.10) version? Thank you!

Re: RDD-based Spark-HBase Connector in HDP 2.6

@Daniel Müller Spark version was upgraded to 2.x per this jira : https://issues.apache.org/jira/browse/HBASE-16179

Suggest you to use spark 2.x rather than 1.6.

If you really have a hard requirement of 1.6 spark, You may need to change the version and build the hbase-spark manually.

Re: RDD-based Spark-HBase Connector in HDP 2.6

Expert Contributor

@Sandeep Nemuri I need to use Spark 1.6. I updated my question, could you please have a look? Thank you!

Re: RDD-based Spark-HBase Connector in HDP 2.6

New Contributor

Had the same problem.

- Download the hbase code base from https://github.com/apache/hbase

- build the whole project using the command mvn clean install -DskipTests

- Update the pom.xml file of hbase-spark module as below and build only that module. Jar built from this module can be used for Spark 1.6.3 and Scala 2.10.

<spark.version>1.6.3</spark.version>
<scala.version>2.10.5</scala.version>
<scala.binary.version>2.10</scala.binary.version>

Re: RDD-based Spark-HBase Connector in HDP 2.6

Expert Contributor

@Kalmesh Sambrani Thank you for the fast answer. Which HBase version shall I download? Just tried the actual master branch, but this seems not to work for my Spark app. When I look into the HBase 1.1-branch there's no "hbase-spark" directory!

Re: RDD-based Spark-HBase Connector in HDP 2.6

New Contributor

You can download from the master branch (which has version 3.0.0-SNAPSHOT). hbase-spark jar built with above mentioned pom changes worked fine for me on spark 1.6.3 and HBase versions (1.2.6 as well as 1.1.2).

As mentioned in my answer above, first build the whole hbase project as it is. Then update the pom of hbase-spark module alone and build only that module. Let me know if you face any issues while building.

Re: RDD-based Spark-HBase Connector in HDP 2.6

Expert Contributor

@Kalmesh Sambrani I updated my question, could you please have a look? Thank you!

Don't have an account?
Coming from Hortonworks? Activate your account here