Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Sharding in HDFS

avatar
Rising Star

Hi,

Is it possible to achieve sharding in HDFS, even though HDFS has its own replica mechanism, What if I want to use sharding mechanism in HDFS ? If its possible, than how one can achieve this ?

1 ACCEPTED SOLUTION

avatar
Super Guru

@Viraj Vekaria

Yes. HDFS does the automated sharding and you have no control on it, but rarely one thinks about sharding of a file system like HDFS, but of an actual database like RDBMS, MongoDB or HBase. Semantics, but you asked to use the sharding in HDFS which implies manual control. It is done automatically. At most, what you could do is to change the global replication factor, change the replication factor per file, but you can't do anything about what is replicated where, no data locality control.

Since @Justin Watkins mentioned traditional RDBMS and I mentioned also MongoDB and you asked about HDFS, I will summarize the differences in approach to achieve scalability between these three with the added HBase touch.

Traditional RDBMS often run into bottlenecks with scalability and data replication when handling large amounts of data/data sets. There are some creative ways to setup master-slave setups to achieve some scalability and performance and all are coming by design, not out of box sharding.

MongoDB sharding can be applied to allow distribution across multiple systems for horizontal scalability as needed.

Like MongoDB, Hadoop’s HBase database accomplishes horizontal scalability through database sharding. Distribution of data storage is handled by the HDFS, with an optional data structure implemented with HBase, which allocates data into columns (versus the two-dimensional allocation of an RDBMS in columns and rows). Data can then be indexed (through use of software like Solr), queried with Hive, or have various analytics or batch jobs run on it with choices available from the Hadoop ecosystem or your choice of business intelligence platform.

If any of the responses is helpful, please don't forget to vote and accept the best answer. Thanks.

View solution in original post

4 REPLIES 4

avatar
Contributor

I would answer this question by asking what you are trying to achieve. Sharding (as I understand it) is used in traditional databases to do some of the distributed stuff that Hadoop does ... but in a different way.

The "horizontal partitioning" in shards sounds similar to column-oriented storage. See ORC files in Hive.

The "distributing tables across servers to spread the load" part of sharding is what HDFS does natively.

If you are trying to do in Hadoop what you do in a relational database, then I would advise that you take a deeper look at the way that Hadoop works.

It is also possible that I've misunderstood your question, and what you are trying to achieve.

avatar
Super Guru

@Justin Watkins

I would not associate SHARDING with traditional RDBMS databases, that is mostly an exception, but with NoSQL databases like MongoDB, etc where is mostly the rule.

@Viraj Vekaria

What are you trying to achieve?

avatar
Rising Star

@Constantin Stanca : I am trying to find out if sharding is done by HDFS automatically or what ? If not than How can we achieve that...but I think as per Justin's answer on that, its done natively.

So basically I just want sharding no matter in what way !!

avatar
Super Guru

@Viraj Vekaria

Yes. HDFS does the automated sharding and you have no control on it, but rarely one thinks about sharding of a file system like HDFS, but of an actual database like RDBMS, MongoDB or HBase. Semantics, but you asked to use the sharding in HDFS which implies manual control. It is done automatically. At most, what you could do is to change the global replication factor, change the replication factor per file, but you can't do anything about what is replicated where, no data locality control.

Since @Justin Watkins mentioned traditional RDBMS and I mentioned also MongoDB and you asked about HDFS, I will summarize the differences in approach to achieve scalability between these three with the added HBase touch.

Traditional RDBMS often run into bottlenecks with scalability and data replication when handling large amounts of data/data sets. There are some creative ways to setup master-slave setups to achieve some scalability and performance and all are coming by design, not out of box sharding.

MongoDB sharding can be applied to allow distribution across multiple systems for horizontal scalability as needed.

Like MongoDB, Hadoop’s HBase database accomplishes horizontal scalability through database sharding. Distribution of data storage is handled by the HDFS, with an optional data structure implemented with HBase, which allocates data into columns (versus the two-dimensional allocation of an RDBMS in columns and rows). Data can then be indexed (through use of software like Solr), queried with Hive, or have various analytics or batch jobs run on it with choices available from the Hadoop ecosystem or your choice of business intelligence platform.

If any of the responses is helpful, please don't forget to vote and accept the best answer. Thanks.