Support Questions

wfloyd · ‎02-09-2016

Customer has "Cluster A" (20 node standard Hadoop cluster: HDFS, YARN, Hive, etc. but no HBase). Customer is adding "Cluster B" (6 nodes dedicated for HBase use). Cluster A and Cluster B are on neighoring racks in the same datacenter, same VLAN, etc.

Is it technically safe/possible to install the RegionServers in "Cluster B", but point them to the HDFS instance in "Cluster A"?

If this is possible, what compromises would we make in terms of HBase performance? Certain SCANs would be more slow as the RegionServers loaded remote HFiles into memory? Writes would be more slow due to no DataNode service running in Cluster B with HBase servers?

Thanks!

elserj · ‎02-10-2016

As long as the networks are routable, this should be functional to not co-locate HDFS and HBase (ZooKeeper gets scary, consider all of the following to apply only to HDFS and HBase).

I believe writes will only be slower with respect to the underlying network connection. Each write will still be bound by the sync of the slowest of the three datanodes hosting the replicas for the block, so, I don't think you'll pay a much larger penalty here.

Reads, however, will likely be noticeably slower. HBase largely expects to take advantage of a feature in HDFS known as "Short Circuit Reads". This features takes advantage of the local resources that the Datanode and the RegionServer share with a shared memory segment and a Unix domain socket. This avoids a TCP socket when the local RegionServer can read data from the local DataNode. I'm sure there are performance numbers out there floating around (I don't recall specifics), but this is a noticeable performance gain when short-circuit reads can be used (and is why many metrics often consider RegionServer locality to the Regions' data it is hosting).

View solution in original post

nsabharwal · ‎02-09-2016

@Wes Floyd Great question! @Enis @Josh Elser

elserj · ‎02-10-2016

As long as the networks are routable, this should be functional to not co-locate HDFS and HBase (ZooKeeper gets scary, consider all of the following to apply only to HDFS and HBase).

I believe writes will only be slower with respect to the underlying network connection. Each write will still be bound by the sync of the slowest of the three datanodes hosting the replicas for the block, so, I don't think you'll pay a much larger penalty here.

Reads, however, will likely be noticeably slower. HBase largely expects to take advantage of a feature in HDFS known as "Short Circuit Reads". This features takes advantage of the local resources that the Datanode and the RegionServer share with a shared memory segment and a Unix domain socket. This avoids a TCP socket when the local RegionServer can read data from the local DataNode. I'm sure there are performance numbers out there floating around (I don't recall specifics), but this is a noticeable performance gain when short-circuit reads can be used (and is why many metrics often consider RegionServer locality to the Regions' data it is hosting).

cnauroth · ‎02-10-2016

Apache JIRA HDFS-347 contains some benchmarks related to HDFS short-circuit read. There is a lot of commentary on that issue, so it would take some effort to scan through and find the relevant comments about the benchmarks.

wfloyd · ‎02-12-2016

Very helpful guys. Appreciated!

Cloudera Community

Support Questions

Do HBase and HDFS need to be co-located on the same machines? If so, how much?

Co-located client

HBase and HDFS Balancer

Price Optimization with PyGurobi in Cloudera Machi...

Why does /tmp/hive/admin/ take up so much space?

How to setup Model Registry on Cloudera Machine Le...

Installing Django in Cloudera Machine Learning (CM...

Using Custom Data Connections in Cloudera Machine ...

Nifi -taking so much time(misleading) to redirect ...

Spark in CML: Recommendations for using Spark in C...

Tuning Hyperparameters with Experiments feature on...