I have read in many places that Hbase does not perform well for joins but has good performance when performed a random read/write. My question is would it still give good performance if there is a bulk scan of Hbase table using full rowkey (like say scanning 30% of table where the scanned rowkeys are random and distributed in nature and not query just a few regions of the table)
Consider a Hbase table whose regions are equally distributed across many region servers. If a external table is created for such a table in Hive and this external table is joined with another Hive managed table based on Rowkey from Hbase table, would huge number of rowkey scans during the join on Hbase table be a performance bottleneck in this scenario?
If so, could you please explain why?
I can't speak for performance hit in this scenario but considering other workloads hitting HBase at the same time it might still be an issue. There's a great feature available where you map a hive schema to an HBase snapshot which promises a lot better performance than hitting HBase directly. Please take a look at https://community.hortonworks.com/content/kbentry/14806/working-with-hbase-and-hive-wip.html for an example. Essentially, you map hive external table to an HBase snapshot, run your analysis and then remove snapshot. This bypasses HBase RS all together and uses MR instead.