Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Amount of data storage : HDFS vs NoSQL

avatar
Rising Star

In several sources on internet, they explain that HDFS is build to handle more amount of data than nosql solutions(cassandra for ex). in general when we go further than 1To we must start thinking Hadoop(HDFS) and not NoSQL.

Beside the architecture and the fact that HDFS performs in batch and that most of noSQL (ex : cassandra) perform in random I/O, and beside the schema design differences, why NoSQL Solutions cassandra for example can't handle the same amount of data like HDFS ?

Why can't we use those solutions as datalake, why we only use them as hot storage solutions in a big data architecture.

thanks a lot

tazimehdi.com
1 ACCEPTED SOLUTION

avatar
Expert Contributor

@Mehdi TAZI

As @Arpit Agarwal mentioned this is not related to CAP theorem. HDFS and Cassandra exposes different kind of interfaces so an apple to apple comparison is not possible. From the papers and benchmarking results that I have seen Cassandra is often restricted to sub-1000 nodes.

References : Planet Cassandra http://www.planetcassandra.org/nosql-performance-benchmarks/

Netflix Engineering Blog :

http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

http://techblog.netflix.com/2014/07/revisiting-1-million-writes-per-second.html

It is typical to see HDFS clusters with sizes far more than 1000's of nodes, so the scale at which HDFS operates is very different from Cassandra. Please keep in mind that a 300 odd nodes dedicated to No-SQL storage can store large amounts of data. However where HDFS shines is the diverse set of applications that you can run on it. Cassandra addresses a very focused scenario, where as HDFS is very general purpose. You can run a set of application including HBase which provides functionality that Cassandra provides.

So if you are an enterprise it is often the case that you have needs that can only be addressed by different tools, and HDFS will provide access to set of tools that operate upon your data.

At this point of time, we have no data that says Cassandra can or cannot handle same amount of data as HDFS, I think the only data point is typically Cassandra benchmarks are run with with much smaller number of nodes.

View solution in original post

9 REPLIES 9

avatar
Master Mentor
@Mehdi TAZI

Why NoSQL Solutions cassandra for example can't handle the same amount of data like HDFS ?

You can find good explanation here

avatar
Rising Star

Thanks a lot, i had already seen this post, and there is still no answer why cassandra can't manage the same amount of data as hadoop does.

nb : the accepted answer is not completely true, cassandra doesn't run over HDFS.

tazimehdi.com

avatar
Master Mentor

@Mehdi TAZI Agree on Cassandra file system. It's CFS

I won't compare Cassandra with HDFS. HDFS is storage layer and Cassandra is nosql database.

avatar
Master Mentor

@Mehdi TAZI Hope it was helpful.

avatar

@Mehdi TAZI HDFS is not NoSQL. NoSQL solutions place schemas (albeit flexible and loose schemas) on the data and are considered alternatives to traditional relational systems. HDFS is scalable, redundant storage and assumes no structure on the data.

Many NoSQL solutions in fact use HDFS for their storage. The point is that when you land your data (pdf, txt, json, xml...) in HDFS you have the flexibility to operate on that data with any tool you choose. In many cases the tools you can use to analyze data structured in a NoSQL solution is limited.

If you want to dig further, I suggest reading up on the CAP Theorem. All database systems must adhere to the CAP Theorem. Because HDFS is storage, it doesn't have this limitation.

avatar
Rising Star

Hello ! thanks a lot for your answer, i did read CAP Theorem,but i still can't see why Cassandra can't handle the same amount of data as hadoop does.

tazimehdi.com

avatar

Cassandra uses a filesystem similar to HDFS so, yes, Cassandra, like HBase, can scale. The difference is Cassandra is a solution while Hadoop, HDFS in particular, is a platform. Use Cassandra for specific use cases and access patterns but use HDFS as your data lake.

avatar

That sounds wrong. The CAP theorem is an assertion about tradeoffs in all distributed systems and is equally applicable to HDFS. We do make tradeoffs within HDFS to prioritize consistency.

avatar
Expert Contributor

@Mehdi TAZI

As @Arpit Agarwal mentioned this is not related to CAP theorem. HDFS and Cassandra exposes different kind of interfaces so an apple to apple comparison is not possible. From the papers and benchmarking results that I have seen Cassandra is often restricted to sub-1000 nodes.

References : Planet Cassandra http://www.planetcassandra.org/nosql-performance-benchmarks/

Netflix Engineering Blog :

http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

http://techblog.netflix.com/2014/07/revisiting-1-million-writes-per-second.html

It is typical to see HDFS clusters with sizes far more than 1000's of nodes, so the scale at which HDFS operates is very different from Cassandra. Please keep in mind that a 300 odd nodes dedicated to No-SQL storage can store large amounts of data. However where HDFS shines is the diverse set of applications that you can run on it. Cassandra addresses a very focused scenario, where as HDFS is very general purpose. You can run a set of application including HBase which provides functionality that Cassandra provides.

So if you are an enterprise it is often the case that you have needs that can only be addressed by different tools, and HDFS will provide access to set of tools that operate upon your data.

At this point of time, we have no data that says Cassandra can or cannot handle same amount of data as HDFS, I think the only data point is typically Cassandra benchmarks are run with with much smaller number of nodes.