We have to get large dataset of hbase table and then deduplicate it with a small csv file. It seems as if we are not using optimum methord to read. can anyone help?
I guess you like to reduce execution time for your job? Can you provide some more details on your job? You are using the hbase-connector? What means deduplication, will it happen on all attributes, or just on some of them? Based on my experience it is typically faster to determine duplicates by first calculating a hash and comparing the hash instead of comparing the attributes one by one.
Would it be an option to store the records 'deduplicated' in Hbase and just add columns or versions, or is a change for the Hbase 'feeding' anyway outside your control?