Support Questions

Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

What is the best way to read data from massive hbase table with spark 2? Can we optimize this process?

We have to get large dataset of hbase table and then deduplicate it with a small csv file. It seems as if we are not using optimum methord to read. can anyone help?


Super Collaborator

I guess you like to reduce execution time for your job? Can you provide some more details on your job? You are using the hbase-connector? What means deduplication, will it happen on all attributes, or just on some of them? Based on my experience it is typically faster to determine duplicates by first calculating a hash and comparing the hash instead of comparing the attributes one by one.
Would it be an option to store the records 'deduplicated' in Hbase and just add columns or versions, or is a change for the Hbase 'feeding' anyway outside your control?

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.