Created 04-17-2018 09:51 AM
Hello All,
My understanding was that Spark is an alternative to Hadoop. However, when trying to install Spark, the installation page asks for an existing Hadoop installation. I'm not able to find anything that clarifies that relationship.
Secondly, Spark apparently has good connectivity to Cassandra and Hive. Both have sql style interface. However, Spark has its own sql. Why would one use Cassandra tutorial/Hive instead of Spark's native sql? Assuming that this is a brand new project with no existing installation?
Help me out.
Thanks
Created 04-17-2018 10:25 AM
Let me try to explain, though not all the details will be included:
Cassandra is a BigTable database, that can use multiple nodes for storage and processing, and the files can be stored in HDFS or other filesystems on the nodes. Cassandra comes with a query interfaces that has a SQL style (but not for the full SQL standard). Cassandra is not part of Hadoop, but can be used together with Hadoop and HDFS.
Hive itself is a layer that can provide you sql style access to flat files in HDFS, to other DB being connected to Hive via JDBC or internal Hive tables (also stored in HDFS). Hive is part of Hadoop and typically described as the DWH layer of Hadoop, as it brings SQL query capabilities and a structured view on the files.
Spark is a parallel processing framework, that is now also part of Hadoop, but was originally developed independent. It can access HDFS files, or streaming data or even external SQL databases and much more. It is typically considered an alternative to the Map&Reduce framework, which was the basis for the first Hadoop versions (together with HDFS) and is still there in Hadoop. For processing the data it brings several options to include code i.e. written in Python or R, or even SQL. This depends of course on the data you are processing. I.e image data is not well processed with Spark SQL.
Created 04-17-2018 10:25 AM
Let me try to explain, though not all the details will be included:
Cassandra is a BigTable database, that can use multiple nodes for storage and processing, and the files can be stored in HDFS or other filesystems on the nodes. Cassandra comes with a query interfaces that has a SQL style (but not for the full SQL standard). Cassandra is not part of Hadoop, but can be used together with Hadoop and HDFS.
Hive itself is a layer that can provide you sql style access to flat files in HDFS, to other DB being connected to Hive via JDBC or internal Hive tables (also stored in HDFS). Hive is part of Hadoop and typically described as the DWH layer of Hadoop, as it brings SQL query capabilities and a structured view on the files.
Spark is a parallel processing framework, that is now also part of Hadoop, but was originally developed independent. It can access HDFS files, or streaming data or even external SQL databases and much more. It is typically considered an alternative to the Map&Reduce framework, which was the basis for the first Hadoop versions (together with HDFS) and is still there in Hadoop. For processing the data it brings several options to include code i.e. written in Python or R, or even SQL. This depends of course on the data you are processing. I.e image data is not well processed with Spark SQL.
Created 04-18-2018 04:10 AM
Hi @Harald Berghoff Thanks for the information.