Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

What is the best way to read data from massive hbase table with spark 2? Can we optimize this process?

What is the best way to read data from massive hbase table with spark 2? Can we optimize this process?

New Contributor

We have to get large dataset of hbase table and then deduplicate it with a small csv file. It seems as if we are not using optimum methord to read. can anyone help?

1 REPLY 1

Re: What is the best way to read data from massive hbase table with spark 2? Can we optimize this process?

Super Collaborator

I guess you like to reduce execution time for your job? Can you provide some more details on your job? You are using the hbase-connector? What means deduplication, will it happen on all attributes, or just on some of them? Based on my experience it is typically faster to determine duplicates by first calculating a hash and comparing the hash instead of comparing the attributes one by one.
Would it be an option to store the records 'deduplicated' in Hbase and just add columns or versions, or is a change for the Hbase 'feeding' anyway outside your control?