Created 01-19-2016 01:02 PM
In several sources on internet, they explain that HDFS is build to handle more amount of data than nosql solutions(cassandra for ex). in general when we go further than 1To we must start thinking Hadoop(HDFS) and not NoSQL.
Beside the architecture and the fact that HDFS performs in batch and that most of noSQL (ex : cassandra) perform in random I/O, and beside the schema design differences, why NoSQL Solutions cassandra for example can't handle the same amount of data like HDFS ?
Why can't we use those solutions as datalake, why we only use them as hot storage solutions in a big data architecture.
thanks a lot
Created 01-21-2016 12:20 AM
As @Arpit Agarwal mentioned this is not related to CAP theorem. HDFS and Cassandra exposes different kind of interfaces so an apple to apple comparison is not possible. From the papers and benchmarking results that I have seen Cassandra is often restricted to sub-1000 nodes.
References : Planet Cassandra http://www.planetcassandra.org/nosql-performance-benchmarks/
Netflix Engineering Blog :
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
http://techblog.netflix.com/2014/07/revisiting-1-million-writes-per-second.html
It is typical to see HDFS clusters with sizes far more than 1000's of nodes, so the scale at which HDFS operates is very different from Cassandra. Please keep in mind that a 300 odd nodes dedicated to No-SQL storage can store large amounts of data. However where HDFS shines is the diverse set of applications that you can run on it. Cassandra addresses a very focused scenario, where as HDFS is very general purpose. You can run a set of application including HBase which provides functionality that Cassandra provides.
So if you are an enterprise it is often the case that you have needs that can only be addressed by different tools, and HDFS will provide access to set of tools that operate upon your data.
At this point of time, we have no data that says Cassandra can or cannot handle same amount of data as HDFS, I think the only data point is typically Cassandra benchmarks are run with with much smaller number of nodes.
Created 01-19-2016 01:16 PM
Why NoSQL Solutions cassandra for example can't handle the same amount of data like HDFS ?
You can find good explanation here
Created 01-19-2016 01:41 PM
Thanks a lot, i had already seen this post, and there is still no answer why cassandra can't manage the same amount of data as hadoop does.
nb : the accepted answer is not completely true, cassandra doesn't run over HDFS.
Created 01-19-2016 02:19 PM
@Mehdi TAZI Agree on Cassandra file system. It's CFS
I won't compare Cassandra with HDFS. HDFS is storage layer and Cassandra is nosql database.
Created 01-19-2016 05:44 PM
@Mehdi TAZI Hope it was helpful.
Created 01-19-2016 01:16 PM
@Mehdi TAZI HDFS is not NoSQL. NoSQL solutions place schemas (albeit flexible and loose schemas) on the data and are considered alternatives to traditional relational systems. HDFS is scalable, redundant storage and assumes no structure on the data.
Many NoSQL solutions in fact use HDFS for their storage. The point is that when you land your data (pdf, txt, json, xml...) in HDFS you have the flexibility to operate on that data with any tool you choose. In many cases the tools you can use to analyze data structured in a NoSQL solution is limited.
If you want to dig further, I suggest reading up on the CAP Theorem. All database systems must adhere to the CAP Theorem. Because HDFS is storage, it doesn't have this limitation.
Created 01-19-2016 01:39 PM
Hello ! thanks a lot for your answer, i did read CAP Theorem,but i still can't see why Cassandra can't handle the same amount of data as hadoop does.
Created 01-19-2016 02:09 PM
Cassandra uses a filesystem similar to HDFS so, yes, Cassandra, like HBase, can scale. The difference is Cassandra is a solution while Hadoop, HDFS in particular, is a platform. Use Cassandra for specific use cases and access patterns but use HDFS as your data lake.
Created 01-20-2016 11:53 PM
That sounds wrong. The CAP theorem is an assertion about tradeoffs in all distributed systems and is equally applicable to HDFS. We do make tradeoffs within HDFS to prioritize consistency.
Created 01-21-2016 12:20 AM
As @Arpit Agarwal mentioned this is not related to CAP theorem. HDFS and Cassandra exposes different kind of interfaces so an apple to apple comparison is not possible. From the papers and benchmarking results that I have seen Cassandra is often restricted to sub-1000 nodes.
References : Planet Cassandra http://www.planetcassandra.org/nosql-performance-benchmarks/
Netflix Engineering Blog :
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
http://techblog.netflix.com/2014/07/revisiting-1-million-writes-per-second.html
It is typical to see HDFS clusters with sizes far more than 1000's of nodes, so the scale at which HDFS operates is very different from Cassandra. Please keep in mind that a 300 odd nodes dedicated to No-SQL storage can store large amounts of data. However where HDFS shines is the diverse set of applications that you can run on it. Cassandra addresses a very focused scenario, where as HDFS is very general purpose. You can run a set of application including HBase which provides functionality that Cassandra provides.
So if you are an enterprise it is often the case that you have needs that can only be addressed by different tools, and HDFS will provide access to set of tools that operate upon your data.
At this point of time, we have no data that says Cassandra can or cannot handle same amount of data as HDFS, I think the only data point is typically Cassandra benchmarks are run with with much smaller number of nodes.