Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Ambari/Spark/Hadoop Cluster and Elasticsearch Integration

avatar
Contributor

I have a Hadoop/Spark cluster setup via Ambari (HDP -2.6.2.0). Now that I have my cluster running, I want to feed some data into it. We have an Elasticsearch cluster on premise (version 5.6). I want to setup the ES-Hadoop Connector (https://www.elastic.co/guide/en/elasticsearch/hadoop/current/doc-sections.html) that Elastic provides so I can dump some data from Elastic to HDFS.

I grabbed the ZIP file with the JARS and followed the directions on a blog post at CERN:

https://db-blog.web.cern.ch/blog/prasanth-kothuri/2016-05-integrating-hadoop-and-elasticsearch-%E2%8...

So far, this seems reasonable, but I have some questions:

1. We have SSL/TLS setup on our Elasticsearch cluster, so when I perform a query, I obviously get an error using the example on the blog. What do I need to do on my Hadoop/Spark side and on the Elastic side to make this communication work?

2. I read that I need to add those JARS to the Spark classpath - is there a rule of thumb as to where i should put those on my cluster? I assume on of my Spark Client nodes, but I am not sure. Also, once i put them there, is there a way to add them to the classpath so that all of my nodes / client nodes have the same classpath? Maybe something in Ambari provides that?

Basically what I am looking for is to be able to preform a query to ES from Spark that triggers a job that tells ES to push "X" amount of data to my HDFS. Based on what I can read on the Elastic site, this is how I think it should work, but I am really confused by the documentation. It's lacking and has confused both me and my Elastic team. Can someone provide some clear directions or some clarity around what I need to do to set this up?

1 ACCEPTED SOLUTION

avatar
Contributor

All - just an update. The ES-Hadoop connector, as it should be, is something more in the benefit of Elasticsearch, not so much Spark or Hadoop. It will allow me to connect to the Elasticsearch cluster with spark-shell or PySpark. This is great for ad-hoc queries, however, for long term data movement, use Apache NiFi. The setup, if you are interested, can be found via Stackoverflow here, where I got some great help:

https://stackoverflow.com/questions/47399391/using-nifi-to-pull-elasticsearch-indexes?noredirect=1#c...

One issue I ran into was that we have SSL setup on Elasticsearch and while I was referencing that cert (I had to convert the PEM format to JKS, since Hadoop/Spark only understand JKS), it wasn't working. After working with Elasticsearch support, they had me add the CERT to the CACERTS file in my Java installation and everything worked after that. I had to do this on each box in my cluster for Spark/Hadoop if I ran a job across the cluster. If I ran in stand-alone mode, the single box was fine. Either way, this can save you a lot of issues, just add your Elasticsearch CERT to the CACERTS using the keytool.

View solution in original post

1 REPLY 1

avatar
Contributor

All - just an update. The ES-Hadoop connector, as it should be, is something more in the benefit of Elasticsearch, not so much Spark or Hadoop. It will allow me to connect to the Elasticsearch cluster with spark-shell or PySpark. This is great for ad-hoc queries, however, for long term data movement, use Apache NiFi. The setup, if you are interested, can be found via Stackoverflow here, where I got some great help:

https://stackoverflow.com/questions/47399391/using-nifi-to-pull-elasticsearch-indexes?noredirect=1#c...

One issue I ran into was that we have SSL setup on Elasticsearch and while I was referencing that cert (I had to convert the PEM format to JKS, since Hadoop/Spark only understand JKS), it wasn't working. After working with Elasticsearch support, they had me add the CERT to the CACERTS file in my Java installation and everything worked after that. I had to do this on each box in my cluster for Spark/Hadoop if I ran a job across the cluster. If I ran in stand-alone mode, the single box was fine. Either way, this can save you a lot of issues, just add your Elasticsearch CERT to the CACERTS using the keytool.