Support Questions

daniel_arnreich · ‎06-29-2017

Hi,

I have an HBase table with some records that don't have a fixed set of columns - and I want to read/process those records with Spark. At the time of reading the data from HBase it is not possible to list/name the columns that I need.

Currently I read the data as following:

import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.hadoop.fs.Path
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.client.{HBaseAdmin, Put, HTable, Result}
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable

val conf = HBaseConfiguration.create()
conf.addResource(new Path("/usr/hdp/current/hbase-client/conf/hbase-site.xml"))
conf.set("hbase.zookeeper.quorum", "xxx")
conf.set(TableInputFormat.INPUT_TABLE, "TEST_TABLE")
val hBaseRDD = sc.newAPIHadoopRDD(conf,  classOf[org.apache.hadoop.hbase.mapreduce.TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
hBaseRDD.count()

This works - but it is super slow! In a test HBase table with just 16mio records (each only having 1 column) it takes 4.5 minutes to execute the count. (The test table is distributed across 2 HBase Regions, so reading the data is done by only 2 Spark executors)

Do you have an idea/suggestion to accelerate the reading of the data?

Unfortunately I did not find any alternative of reading the data instead with the sc.newAPIHadoopRDD() as all HBase-Spark connectors I came across seem to require to provide the "schema" beforehand.

Thanks for your help in advance!

Daniel

vshukla · ‎06-29-2017

You may want to look at Spark HBase connector that is based on DataFrame and should give better performance.

https://github.com/hortonworks-spark/shc

daniel_arnreich · ‎06-30-2017

I came across this connector already - but it requires to define the columns you want to read (right?) - which is not an option in my case as the column names for example contain product IDs - which is a huge number of distinct values and also changing over time...

rupesh_agarwal · ‎02-27-2018

val sc=new Scan()
add two filters in filterlist (keyonlyfilter,firstkeyonlyfilter)
add that filterlist to scan
val conf =HBaseConfiguration.create()
conf.addResource(newPath("/usr/hdp/current/hbase-client/conf/hbase-site.xml"))
conf.set("hbase.zookeeper.quorum","xxx")
conf.set(TableInputFormat.INPUT_TABLE,"TEST_TABLE")
conf.set(TableInputFormat.SCAN,convertscantoString(sc))
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[org.apache.hadoop.hbase.mapreduce.TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
hBaseRDD.count()

Cloudera Community

Support Questions

Efficiently read HBase records without schema in Spark?

Efficient time-series applications in HBase

PySpark JSON read with strict schema check and mar...

HBase Spark in CDP

Spark Python Supportability Matrix

Streamlining Data Processing with Spark HBase Inte...

Using Apache NiFi to Validate that Records Adhere ...

Spark to read the Hive tables under information_sc...

Spark cannot read Hive managed table

Read Hbase tables using Spark 2

RecordReader : How Records from a file are read i...