- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Efficiently read HBase records without schema in Spark?
- Labels:
-
Apache HBase
-
Apache Spark
Created ‎06-29-2017 02:00 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have an HBase table with some records that don't have a fixed set of columns - and I want to read/process those records with Spark. At the time of reading the data from HBase it is not possible to list/name the columns that I need.
Currently I read the data as following:
import org.apache.spark._ import org.apache.spark.rdd.NewHadoopRDD import org.apache.hadoop.fs.Path import org.apache.hadoop.hbase.util.Bytes import org.apache.hadoop.hbase.HColumnDescriptor import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor} import org.apache.hadoop.hbase.client.{HBaseAdmin, Put, HTable, Result} import org.apache.hadoop.hbase.mapreduce.TableInputFormat import org.apache.hadoop.hbase.io.ImmutableBytesWritable val conf = HBaseConfiguration.create() conf.addResource(new Path("/usr/hdp/current/hbase-client/conf/hbase-site.xml")) conf.set("hbase.zookeeper.quorum", "xxx") conf.set(TableInputFormat.INPUT_TABLE, "TEST_TABLE") val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[org.apache.hadoop.hbase.mapreduce.TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) hBaseRDD.count()
This works - but it is super slow! In a test HBase table with just 16mio records (each only having 1 column) it takes 4.5 minutes to execute the count. (The test table is distributed across 2 HBase Regions, so reading the data is done by only 2 Spark executors)
Do you have an idea/suggestion to accelerate the reading of the data?
Unfortunately I did not find any alternative of reading the data instead with the sc.newAPIHadoopRDD() as all HBase-Spark connectors I came across seem to require to provide the "schema" beforehand.
Thanks for your help in advance!
Daniel
Created ‎06-29-2017 06:24 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You may want to look at Spark HBase connector that is based on DataFrame and should give better performance.
Created ‎06-30-2017 06:02 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I came across this connector already - but it requires to define the columns you want to read (right?) - which is not an option in my case as the column names for example contain product IDs - which is a huge number of distinct values and also changing over time...
Created ‎02-27-2018 10:03 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- val sc=new Scan()
- add two filters in filterlist (keyonlyfilter,firstkeyonlyfilter)
- add that filterlist to scan
- val conf =HBaseConfiguration.create()
- conf.addResource(newPath("/usr/hdp/current/hbase-client/conf/hbase-site.xml"))
- conf.set("hbase.zookeeper.quorum","xxx")
- conf.set(TableInputFormat.INPUT_TABLE,"TEST_TABLE")
- conf.set(TableInputFormat.SCAN,convertscantoString(sc))
- val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[org.apache.hadoop.hbase.mapreduce.TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
- hBaseRDD.count()
