Support Questions

Pattern · ‎09-18-2022

We need to load a big part of data from HBase using Spark.

Then we put it into Kafka and read by consumer. But consumer is too slow

At the same time Kafka memory is not enough to keep all scan result.

Our key contain `...yyyy.MM.dd`, and now we load 30 days in one Spark job, using operator filter.But we cant split job to many jobs, (30 jobs filtering each day), cause then each job will have to scan all HBase, and it will make summary scan to slow.

Now we launch Spark job with 100 threads, but we cant make speed slower by set less threads (for example 7 threads). Cause Kafka is used by third hands developers, that make Kafka sometimes too busy to keep any data. So, we need to control HBase scan speed, checking all time is there a memory in Kafka to store our data

We try to save scan result before load to Kafka into some place, for example in ORC files in hdfs, but scan result make many little files, it is problem to group them by memory (or there is a way, if you know please tell me how?), and store into hdfs little files bad. And merging such a files is very expensive operation and spend a lot of time that will make total time too slow

Sugess solutions:

1) Maybe it is possible to store scan result in hdfs by spark, by set some special flag in `filter` operator and then run 30 spark jobs to select data from saved result and put each result to Kafka when it possible

2) Maybe there is some existed mechanism in spark to stop and continue launched jobs

3) Maybe there is some existed mechanism in spark to separate result by batches (without control to stop and continue loading)

4) Maybe there is some existed mechanism in spark to separate result by batches (with control to stop and continue loading by external condition)

5) Maybe when Kafka will throw an exception (that there is no place to store data), there is some backpressure mechanism in spark that will stop scan for some time if there some exceptions appear in execution (but i guess that there is will be limited retry of restarting to execute operator, is it possible to set restart operation forever, if it is a real solution?). But better to keep some free place in Kafka, and not to wait untill it will be overloaded

6) Do using PageFilter in HBase (but i guess that it is hard to realize), or other solutions variants? And i guess that there is too many objects in memory to use PageFilter

P.S

This https://github.com/hortonworks-spark/shc/issues/108 will not help, we already use filter

Any ideas would be helpful

Cloudera Community

Support Questions

HBase batch loading with speed control cause of slow consumer