Created 05-15-2017 06:15 PM
Hi,
Can I get some xpert advise on the best possible ways to import cassandra tables into hadoop cluster ? and which ports should be open in hadoop for the connection... ?
Thanks in advance.
Created 05-16-2017 06:53 PM
Hi @PJ the easiest and least intrusive way is to use Hortonworks Data Flow (powered by Apache NiFi) to quickly build a data flow that queries Cassandra and sends the results to HDFS. HDF/Nifi includes Cassandra processors that make integration simple. Take a look at this article about ingesting data into hadoop from a RDBMS, but, you would be using the QueryCassandra processor instead: https://community.hortonworks.com/articles/87686/rdbms-to-hive-using-nifi-small-medium-tables.html
As always, if you find this post useful, don't forget to accept the answer.
Created 05-16-2017 06:53 PM
Hi @PJ the easiest and least intrusive way is to use Hortonworks Data Flow (powered by Apache NiFi) to quickly build a data flow that queries Cassandra and sends the results to HDFS. HDF/Nifi includes Cassandra processors that make integration simple. Take a look at this article about ingesting data into hadoop from a RDBMS, but, you would be using the QueryCassandra processor instead: https://community.hortonworks.com/articles/87686/rdbms-to-hive-using-nifi-small-medium-tables.html
As always, if you find this post useful, don't forget to accept the answer.
Created 05-16-2017 09:27 PM
Hi @Sonu Sahi
Thanks for your reply. What about sqoop import in hadoop ? to import cassandra tables into hdfs from hadoop client.
Created 05-16-2017 10:04 PM
Hi @PJ
If you wanted to use sqoop instead of HDF/NiFi to import tables, you would need to get an adequate JDBC driver for Cassandra. I'm not an expert on it, but I think DataStax provides one for their Enterprise software. I've seen quite a few stories about it not working very well though without that JDBC driver. I think HDF/NiFi would be the better option.
Created 05-18-2017 09:07 PM
Hi @Sonu Sahi
Can you be more brief on connectivty between hadoop cluster and cassandra cluster especially when they are in different subnets. What ports and nodes need access?
Thanks,
Padma.
Created 05-18-2017 09:43 PM
Hi @PJ
I can't speak to the network setup specifics in your environment obviously, that should come from the Hadoop and Cassandra admins. I think the default Cassandra port is 9042, but, you can check that with your admin team. If you are using HDF/Nifi, you would specify that port in the QueryCassandra processor. The NiFi nodes will require access over that port to the Cassandra environment, and the nodes will also require access to each node in the hadoop cluster. If you are using Sqoop, the connectivity must be be open between the Cassandra environment and each node in the hadoop cluster on the JDBC port that Cassandra in your environment is configured to use (Sqoop jobs can be initiated from the client node, but will actually instantiate connections from one of the worker nodes in the cluster).
https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html
https://community.hortonworks.com/questions/66961/how-sqoop-internally-works.html
Created 05-19-2017 09:33 PM
hi @Sonu Sahi, i've added nifi as a service in my HDP 2.6, but i'm having some difficulties to connect it to cassandra which is installed on my linux host. I'm wondering if i have to install cassandra in my sandbox too?
Can you please take a look here for more details: https://community.hortonworks.com/questions/103622/how-to-use-querycassandra.html