Support Questions

Find answers, ask questions, and share your expertise

Read HBase Table by using Spark/Scala

avatar
Contributor

Please show me the codes to read HBase table in two ways:

a) Spark/Sala RELP, and

b) Scala project with build.sbt.

Let's say the table DDL and DML is as following from Tutorialspoint:

create 'emp', 'personal data', 'professional data'

put 'emp','1','personal data:name','raju'

put 'emp','1','personal data:city','hyderabad'

put 'emp','1','professional data:designation','manager'

put 'emp','1','professional data:salary','50000'

I have no problems to do it in Java (with Maven); however, I have difficulties to do in Spark/Scala for not finding objects. Can anybody shed some light on it?

09/07/16:

I read " SPARK-ON-HBASE: DATAFRAME BASED HBASE CONNECTOR" ( Github ) and saw the parameters for running spark-shell. In addition, I referred the example Scala code in 80.3 of "Apache HBase Reference Guide", I was able to solve it. Please see attached if you want to have a quick start. Also, I am anxious to try HBase Connector with HDP 2.5. Thank you.scala-hortonworks-community.zip

6 REPLIES 6

avatar
Super Guru

@Charles Chen

Well, I cannot write code for you, but have you looked at this? Everything you are looking for is available at this link below.

https://hbase.apache.org/book.html#spark

avatar
Contributor

Yes, tried it - 80. Scala but in vain but it's a good book. Thanks.

avatar
Contributor

I found this link: How to run spark job to interact with secured HBase cluster (https://community.hortonworks.com/articles/48988/how-to-run-spark-job-to-interact-with-secured-hbas.html), followed the instructions to setup and run the somketest, and got the error: Exception in thread "main" java.io.FileNotFoundException: File file:/usr/hdp/current/hbase-client/lib/guava*.jar does not exist. Found the command and example for my original question but need some final touch. Can anybody shed some light on it?

I checked my VM HDP_2.4_vmware_v3 and the jar file /usr/hdp/current/hbase-client/lib/guava-12.0.1.jar is there.

./bin/spark-submit --class org.apache.spark.examples.HBaseTest --master yarn-cluster --num-executors 2 --driver-memory 512m --executor-memory 512m --executor-cores 1 --jars  /usr/hdp/current/hbase-client/lib/hbase-client.jar,/usr/hdp/current/hbase-client/lib/hbase-common.jar,/usr/hdp/current/hbase-client/lib/hbase-server.jar,/usr/hdp/current/hbase-client/lib/guava*.jar,/usr/hdp/current/hbase-client/lib/hbase-protocol.jar,/usr/hdp/current/hbase-client/lib/htrace-core*.jar  --files conf/hbase-site.xml ./lib/spark-examples*.jar ambarismoketest 
16/08/07 16:22:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/08/07 16:22:13 INFO TimelineClientImpl: Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/ 16/08/07 16:22:13 INFO RMProxy: Connecting to ResourceManager at sandbox.hortonworks.com/192.168.132.140:8050 16/08/07 16:22:14 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
16/08/07 16:22:15 INFO Client: Requesting a new application from cluster with 1 NodeManagers 16/08/07 16:22:15 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (2250 MB per container) 16/08/07 16:22:15 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead 16/08/07 16:22:15 INFO Client: Setting up container launch context for our AM
16/08/07 16:22:15 INFO Client: Setting up the launch environment for our AM container 16/08/07 16:22:15 INFO Client: Using the spark assembly jar on HDFS because you are using HDP, defaultSparkAssembly:hdfs://sandbox.hortonworks.com:8020/hdp/apps/2.4.0.0-169/spark/spark-hdp-assembly.jar
16/08/07 16:22:15 INFO Client: Preparing resources for our AM container 16/08/07 16:22:15 INFO Client: Using the spark assembly jar on HDFS because you are using HDP, defaultSparkAssembly:hdfs://sandbox.hortonworks.com:8020/hdp/apps/2.4.0.0-169/spark/spark-hdp-assembly.jar 16/08/07 16:22:15 INFO Client: Source and destination file systems are the same. Not copying hdfs://sandbox.hortonworks.com:8020/hdp/apps/2.4.0.0-169/spark/spark-hdp-assembly.jar 16/08/07 16:22:15 INFO Client: Uploading resource file:/usr/hdp/2.4.0.0-169/spark/lib/spark-examples-1.6.0.2.4.0.0-169-hadoop2.7.1.2.4.0.0-169.jar -> hdfs://sandbox.hortonworks.com:8020/user/root/.sparkStaging/application_1470585857897_0001/spark-examples-1.6.0.2.4.0.0-169-hadoop2.7.1.2.4.0.0-169.jar 16/08/07 16:22:18 INFO Client: Uploading resource file:/usr/hdp/current/hbase-client/lib/hbase-client.jar -> hdfs://sandbox.hortonworks.com:8020/user/root/.sparkStaging/application_1470585857897_0001/hbase-client.jar 16/08/07 16:22:18 INFO Client: Uploading resource file:/usr/hdp/current/hbase-client/lib/hbase-common.jar -> hdfs://sandbox.hortonworks.com:8020/user/root/.sparkStaging/application_1470585857897_0001/hbase-common.jar 16/08/07 16:22:18 INFO Client: Uploading resource file:/usr/hdp/current/hbase-client/lib/hbase-server.jar -> hdfs://sandbox.hortonworks.com:8020/user/root/.sparkStaging/application_1470585857897_0001/hbase-server.jar 16/08/07 16:22:18 INFO Client: Uploading resource file:/usr/hdp/current/hbase-client/lib/guava*.jar -> hdfs://sandbox.hortonworks.com:8020/user/root/.sparkStaging/application_1470585857897_0001/guava*.jar 16/08/07 16:22:18 INFO Client: Deleting staging directory .sparkStaging/application_1470585857897_0001 Exception in thread "main" java.io.FileNotFoundException: File file:/usr/hdp/current/hbase-client/lib/guava*.jar does not exist
        at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599)
        at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
        at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:317)
        at org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$distribute$1(Client.scala:407)
        at org.apache.spark.deploy.yarn.Client$anonfun$prepareLocalResources$6$anonfun$apply$3.apply(Client.scala:471)
        at org.apache.spark.deploy.yarn.Client$anonfun$prepareLocalResources$6$anonfun$apply$3.apply(Client.scala:470)
        at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
        at org.apache.spark.deploy.yarn.Client$anonfun$prepareLocalResources$6.apply(Client.scala:470)
        at org.apache.spark.deploy.yarn.Client$anonfun$prepareLocalResources$6.apply(Client.scala:468)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:468)
        at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:722)
        at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:142)
        at org.apache.spark.deploy.yarn.Client.run(Client.scala:1065)
        at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1125)
        at org.apache.spark.deploy.yarn.Client.main(Client.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$runMain(SparkSubmit.scala:731)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

avatar

@Charles Chen Please see this blog on how to use HBase with Spark on HDP 2.4.x +

avatar
Contributor

Thank you.

avatar
New Contributor

Hi, this is my Scala code to read Hbase tables. It's working with Hbase latest version 1.1.2.2.6.4.0-91 (HDP 2.6.4, Ambari 2.6.1).

The key parameter is:

conf.set("zookeeper.znode.parent", "/hbase-unsecure")

because zookeeper doesn't hold the Hbase master detail.

Check:

# /usr/hdp/2.6.4.0-91/zookeeper/bin/zkCli.sh -server <server>.hortonworks.com:2181

# ls /hbase-unsecure/master return []

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.Connection
import org.apache.hadoop.hbase.client.ConnectionFactory
import org.apache.hadoop.hbase.TableName

object Hbase {
  val conf: Configuration = HBaseConfiguration.create()
  def main(args: Array[String]): Unit = {


    //conf.set("hbase.master", "<server>.hortonworks.com" + ":" + "60000")
    conf.setInt("timeout", 120000)
    conf.set("hbase.zookeeper.quorum", "<server>.hortonworks.com")
    conf.set("zookeeper.znode.parent", "/hbase-unsecure") // IMPORTANT!!!
    conf.setInt("hbase.client.scanner.caching", 10000)
    val connection: Connection = ConnectionFactory.createConnection(conf)
    val table = connection.getTable(TableName.valueOf("trading"))
    print("connection created")
    val admin = connection.getAdmin
    // List the tables.
    val listtables = admin.listTables()
    listtables.foreach(println)
    connection.close()
  }
}

Result:

'trading', {NAME => 'ca', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}