Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Do HBase and HDFS need to be co-located on the same machines? If so, how much?

Solved Go to solution

Do HBase and HDFS need to be co-located on the same machines? If so, how much?

Contributor

Customer has "Cluster A" (20 node standard Hadoop cluster: HDFS, YARN, Hive, etc. but no HBase). Customer is adding "Cluster B" (6 nodes dedicated for HBase use). Cluster A and Cluster B are on neighoring racks in the same datacenter, same VLAN, etc.

Is it technically safe/possible to install the RegionServers in "Cluster B", but point them to the HDFS instance in "Cluster A"?

If this is possible, what compromises would we make in terms of HBase performance? Certain SCANs would be more slow as the RegionServers loaded remote HFiles into memory? Writes would be more slow due to no DataNode service running in Cluster B with HBase servers?

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Do HBase and HDFS need to be co-located on the same machines? If so, how much?

As long as the networks are routable, this should be functional to not co-locate HDFS and HBase (ZooKeeper gets scary, consider all of the following to apply only to HDFS and HBase).

I believe writes will only be slower with respect to the underlying network connection. Each write will still be bound by the sync of the slowest of the three datanodes hosting the replicas for the block, so, I don't think you'll pay a much larger penalty here.

Reads, however, will likely be noticeably slower. HBase largely expects to take advantage of a feature in HDFS known as "Short Circuit Reads". This features takes advantage of the local resources that the Datanode and the RegionServer share with a shared memory segment and a Unix domain socket. This avoids a TCP socket when the local RegionServer can read data from the local DataNode. I'm sure there are performance numbers out there floating around (I don't recall specifics), but this is a noticeable performance gain when short-circuit reads can be used (and is why many metrics often consider RegionServer locality to the Regions' data it is hosting).

4 REPLIES 4

Re: Do HBase and HDFS need to be co-located on the same machines? If so, how much?

@Wes Floyd Great question! @Enis @Josh Elser

Re: Do HBase and HDFS need to be co-located on the same machines? If so, how much?

As long as the networks are routable, this should be functional to not co-locate HDFS and HBase (ZooKeeper gets scary, consider all of the following to apply only to HDFS and HBase).

I believe writes will only be slower with respect to the underlying network connection. Each write will still be bound by the sync of the slowest of the three datanodes hosting the replicas for the block, so, I don't think you'll pay a much larger penalty here.

Reads, however, will likely be noticeably slower. HBase largely expects to take advantage of a feature in HDFS known as "Short Circuit Reads". This features takes advantage of the local resources that the Datanode and the RegionServer share with a shared memory segment and a Unix domain socket. This avoids a TCP socket when the local RegionServer can read data from the local DataNode. I'm sure there are performance numbers out there floating around (I don't recall specifics), but this is a noticeable performance gain when short-circuit reads can be used (and is why many metrics often consider RegionServer locality to the Regions' data it is hosting).

Re: Do HBase and HDFS need to be co-located on the same machines? If so, how much?

Apache JIRA HDFS-347 contains some benchmarks related to HDFS short-circuit read. There is a lot of commentary on that issue, so it would take some effort to scan through and find the relevant comments about the benchmarks.

Re: Do HBase and HDFS need to be co-located on the same machines? If so, how much?

Contributor

Very helpful guys. Appreciated!