Support Questions

Find answers, ask questions, and share your expertise

Do HBase and HDFS need to be co-located on the same machines? If so, how much?

avatar
Expert Contributor

Customer has "Cluster A" (20 node standard Hadoop cluster: HDFS, YARN, Hive, etc. but no HBase). Customer is adding "Cluster B" (6 nodes dedicated for HBase use). Cluster A and Cluster B are on neighoring racks in the same datacenter, same VLAN, etc.

Is it technically safe/possible to install the RegionServers in "Cluster B", but point them to the HDFS instance in "Cluster A"?

If this is possible, what compromises would we make in terms of HBase performance? Certain SCANs would be more slow as the RegionServers loaded remote HFiles into memory? Writes would be more slow due to no DataNode service running in Cluster B with HBase servers?

Thanks!

1 ACCEPTED SOLUTION

avatar
Super Guru

As long as the networks are routable, this should be functional to not co-locate HDFS and HBase (ZooKeeper gets scary, consider all of the following to apply only to HDFS and HBase).

I believe writes will only be slower with respect to the underlying network connection. Each write will still be bound by the sync of the slowest of the three datanodes hosting the replicas for the block, so, I don't think you'll pay a much larger penalty here.

Reads, however, will likely be noticeably slower. HBase largely expects to take advantage of a feature in HDFS known as "Short Circuit Reads". This features takes advantage of the local resources that the Datanode and the RegionServer share with a shared memory segment and a Unix domain socket. This avoids a TCP socket when the local RegionServer can read data from the local DataNode. I'm sure there are performance numbers out there floating around (I don't recall specifics), but this is a noticeable performance gain when short-circuit reads can be used (and is why many metrics often consider RegionServer locality to the Regions' data it is hosting).

View solution in original post

4 REPLIES 4

avatar
Master Mentor

@Wes Floyd Great question! @Enis @Josh Elser

avatar
Super Guru

As long as the networks are routable, this should be functional to not co-locate HDFS and HBase (ZooKeeper gets scary, consider all of the following to apply only to HDFS and HBase).

I believe writes will only be slower with respect to the underlying network connection. Each write will still be bound by the sync of the slowest of the three datanodes hosting the replicas for the block, so, I don't think you'll pay a much larger penalty here.

Reads, however, will likely be noticeably slower. HBase largely expects to take advantage of a feature in HDFS known as "Short Circuit Reads". This features takes advantage of the local resources that the Datanode and the RegionServer share with a shared memory segment and a Unix domain socket. This avoids a TCP socket when the local RegionServer can read data from the local DataNode. I'm sure there are performance numbers out there floating around (I don't recall specifics), but this is a noticeable performance gain when short-circuit reads can be used (and is why many metrics often consider RegionServer locality to the Regions' data it is hosting).

avatar

Apache JIRA HDFS-347 contains some benchmarks related to HDFS short-circuit read. There is a lot of commentary on that issue, so it would take some effort to scan through and find the relevant comments about the benchmarks.

avatar
Expert Contributor

Very helpful guys. Appreciated!