Support Questions

ambariCloud · ‎09-13-2016

Hi

I want to source data from two hadoop clusters and join in Spark. Will it possible as shown below

//data from cluster1

val erorDF = spark.read.json("hdfs://master:8020//user/ubuntu/error.json")

erorDF.registerTempTable("erorDFTBL")

//data from cluster2

val erorDF2 = spark.read.json("hdfs://master2:8020//user/ubuntu/error.json")

erorDF2.registerTempTable("erorDFTBL2")

Report Inappropriate Content · ‎09-14-2016

Sure! I just did it (with PySpark in Zeppelin, though):

For my test, I spun up two instances of HDP sandbox on Azure, and put a file into HDFS on each cluster. The code snippet reads each file and counts lines in each file individually, then concatenates the data sets and counts the lines of the union.

View solution in original post

Report Inappropriate Content · ‎09-14-2016

Sure! I just did it (with PySpark in Zeppelin, though):

For my test, I spun up two instances of HDP sandbox on Azure, and put a file into HDFS on each cluster. The code snippet reads each file and counts lines in each file individually, then concatenates the data sets and counts the lines of the union.

Chandraprabu · ‎06-29-2022

Please let me know if it is possible to access the hive table present across multiple clusters (On Hortonworks on-premises cluster)

ambariCloud · ‎09-14-2016

Thank you Becker. Will there be any setup I need to do in Zeppelin. I am running my Zeppelin in cluster 1.

Report Inappropriate Content · ‎09-14-2016

No additional setup required - the Spark libraries are automatically imported and the Spark context is provided implicitly by Zeppelin. For any additional dependencies that you project needs, use %dep - see the documentation in https://zeppelin.apache.org/docs/latest/interpreter/spark.html.

ambariCloud · ‎09-14-2016

Tested below in AWS. Looks good. Thank you

//read error JSON file from cluster 1

val erorDF = spark.read.json("hdfs://master:8020/user/ubuntu/error.json")

erorDF.registerTempTable("erorDFTBL")

//read file from cluster 2

val erorDF2 = spark.read.json("hdfs://master2:8020/user/ubuntu/errors")

erorDF2.registerTempTable("erorDFTBL2")

cjervis · ‎06-29-2022

@Chandraprabu As this is an older post, we recommend starting a new thread. The new thread will provide the opportunity to provide details specific to your environment that could aid others in providing a more accurate answer to your question.

Cy Jervis, Manager, Community Program
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Cloudera Community

Support Questions

Spark connecting two hadoop clusters

Hadoop Cluster Maintenance

Setting up a Hadoop/Spark cluster with Docker on a...

Connect Hadoop client on Mac OS X to Kerberized HD...

Multinode Hadoop Cluster Installation

Using the Hadoop Attack Library to Check Your Hado...

Running Spark Application on a Kerberized Hadoop c...

Connect to secure hadoop cluster from non-cluster ...

Struggling to establish jdbc connection with hive ...

Mirroring Datasets Between Hadoop Clusters with Ap...

Using Spark to Virtually Integrate Hadoop with Ext...