Support Questions

Find answers, ask questions, and share your expertise

Best posssible way to connect Cassandra cluster to Hadoop

avatar
Expert Contributor

Hi,

Can I get some xpert advise on the best possible ways to import cassandra tables into hadoop cluster ? and which ports should be open in hadoop for the connection... ?

Thanks in advance.

1 ACCEPTED SOLUTION

avatar
Guru

Hi @PJ the easiest and least intrusive way is to use Hortonworks Data Flow (powered by Apache NiFi) to quickly build a data flow that queries Cassandra and sends the results to HDFS. HDF/Nifi includes Cassandra processors that make integration simple. Take a look at this article about ingesting data into hadoop from a RDBMS, but, you would be using the QueryCassandra processor instead: https://community.hortonworks.com/articles/87686/rdbms-to-hive-using-nifi-small-medium-tables.html

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-cassandra-nar/1.2.0/org.apach...

As always, if you find this post useful, don't forget to accept the answer.

View solution in original post

6 REPLIES 6

avatar
Guru

Hi @PJ the easiest and least intrusive way is to use Hortonworks Data Flow (powered by Apache NiFi) to quickly build a data flow that queries Cassandra and sends the results to HDFS. HDF/Nifi includes Cassandra processors that make integration simple. Take a look at this article about ingesting data into hadoop from a RDBMS, but, you would be using the QueryCassandra processor instead: https://community.hortonworks.com/articles/87686/rdbms-to-hive-using-nifi-small-medium-tables.html

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-cassandra-nar/1.2.0/org.apach...

As always, if you find this post useful, don't forget to accept the answer.

avatar
Expert Contributor

Hi @Sonu Sahi

Thanks for your reply. What about sqoop import in hadoop ? to import cassandra tables into hdfs from hadoop client.

avatar
Guru

Hi @PJ

If you wanted to use sqoop instead of HDF/NiFi to import tables, you would need to get an adequate JDBC driver for Cassandra. I'm not an expert on it, but I think DataStax provides one for their Enterprise software. I've seen quite a few stories about it not working very well though without that JDBC driver. I think HDF/NiFi would be the better option.

avatar
Expert Contributor

Hi @Sonu Sahi

Can you be more brief on connectivty between hadoop cluster and cassandra cluster especially when they are in different subnets. What ports and nodes need access?

Thanks,

Padma.

avatar
Guru

Hi @PJ

I can't speak to the network setup specifics in your environment obviously, that should come from the Hadoop and Cassandra admins. I think the default Cassandra port is 9042, but, you can check that with your admin team. If you are using HDF/Nifi, you would specify that port in the QueryCassandra processor. The NiFi nodes will require access over that port to the Cassandra environment, and the nodes will also require access to each node in the hadoop cluster. If you are using Sqoop, the connectivity must be be open between the Cassandra environment and each node in the hadoop cluster on the JDBC port that Cassandra in your environment is configured to use (Sqoop jobs can be initiated from the client node, but will actually instantiate connections from one of the worker nodes in the cluster).

https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html

https://community.hortonworks.com/questions/66961/how-sqoop-internally-works.html

avatar

hi @Sonu Sahi, i've added nifi as a service in my HDP 2.6, but i'm having some difficulties to connect it to cassandra which is installed on my linux host. I'm wondering if i have to install cassandra in my sandbox too?

Can you please take a look here for more details: https://community.hortonworks.com/questions/103622/how-to-use-querycassandra.html