- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to connect my hadoop spark clustor on cloud which use hdp from my windows eclipse enviornment
- Labels:
-
Apache Spark
Created ‎07-07-2018 08:38 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I have setup my hdp cluster in azure vms which has 3 node. one master and 2 slaves.
i'm connect my master using putty client and open spark-shell. run spark jobs everything works fine.
now i'have setup eclipse scala ide to develop spark application. but i would like to directly connect from my scala ide to the cluster hdfs where my data is stored and would like to run the program.
i have setup below configuration :
val conf = new SparkConf() .setAppName("SparkApp") .setMaster("yarn-client") .set("spark.hadoop.fs.defaultFS", "hdfs://13.76.44.223") .set("spark.hadoop.dfs.nameservices", "13.76.44.223:8020") .set("spark.hadoop.yarn.resourcemanager.hostname", "13.76.44.223") .set("spark.hadoop.yarn.resourcemanager.address", "13.76.44.223:8050").set("spark.driver.host","127.0.0.1") //this my local ip .set("spark.local.ip", "13.76.44.223") //cdh vmnat ip .set("spark.yarn.jar", "hdfs://13.76.44.223:8020/usr/hdp/2.6.5.0-292/spark2/jars/*.jar") .set("mapreduce.app-submission.cross-platform", "true")
val sc = new SparkContext(conf) val file = sc.textFile("/user/hdfs/file.txt")
val words = file.flatMap { line => line.split(" ") }
val wordsmap = words.map { word => (word,1) }
val wordcount = wordsmap.reduceByKey((x,y)=> x+y)
wordcount.collect.foreach(println)
if i run above program from eclipse i'm getting
1. "Error initializing SparkContext. org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/hdfs/.sparkStaging/application_1530938503330_0168/__spark_conf__.zip could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and 2 node(s) are excluded in this operation. "
2. "org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/hdfs/.sparkStaging/application_1530938503330_0168/__spark_conf__.zip could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and 2 node(s) are excluded in this operation."
note : i have open all the necessary port which is used in above config.
please anyone can help me do i missing anything.
Created ‎07-07-2018 11:15 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@mehul godhaniya You need to copy the configuration files hdfs-site.xml, core-site.xml, yarn-site.xml, mapred-site.xml from azure vms to your machine and place them in the eclipse project resource directory so that it will be added to classpath. From classpath those files will be automatically read as configuration for the application.
Once that is done you can simplify your code, especially for SparkConf creation and avoid any missing configuration:
val conf = new SparkConf() .setAppName("SparkApp").setMaster("yarn-client") val sc = new SparkContext(conf) val file = sc.textFile("/user/hdfs/file.txt") val words = file.flatMap { line => line.split(" ") } val wordsmap = words.map { word => (word,1) } val wordcount = wordsmap.reduceByKey((x,y)=> x+y) wordcount.collect.foreach(println)
HTH
*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.
Created ‎07-09-2018 12:46 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@mehul godhaniya the error indicates spark is not able to write __spark_conf__.zip to hdfs. Have you checked your datanodes are up and running and that you can access hdfs correctly from your machine?
Created ‎07-08-2018 06:41 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello @Felix Albani Thanks for your response. i have download all the files from the my vm's and added to the resources folder in my project. however i still get the error "org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/hdfs/.sparkStaging/application_1531024403580_0103/__spark_conf__.zip could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and 2 node(s) are excluded in this operation."
System.setProperty("SPARK_YARN_MODE", "true")
System.setProperty("HADOOP_USER_NAME", "hdfs") //using cloudera user
val conf = new SparkConf() .setAppName("SparkApp") .
setMaster("yarn-client")
.set("spark.yarn.jars", "hdfs://104.215.158.249:8020/usr/hdp/2.6.5.0-292/spark2/jars/*.jar") ///user/hdfs/file.txt
val sc = new SparkContext(conf)
val file = sc.textFile("/user/hdfs/file.txt")
val words = file.flatMap { line => line.split(" ") }
val wordsmap = words.map { word => (word,1) }
val wordcount = wordsmap.reduceByKey((x,y)=> x+y)
wordcount.collect.foreach(println)
sc.stop()
this is the modified code.
any help.
Created ‎07-14-2018 08:44 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Felix Albani thanks for the answer. my datanode is up and running. and it works fine when i run any command from my cluster.
However , problem occurred when i try to use eclipse from my local windows machine and connect it using spark. my cluster is in azure with 3 vms.
below my understanding:
1. All the vm have public and internal ip.
2. namenode is successfully connected with public ip.
3. however datanode is not able to connect using public ip the reason when namenode provide list of ip to write data it provides internal ip address.
below is exception :
Excluding datanode DatanodeInfoWithStorage[10.0.0.8:50010,DS-fa8a8432-25c6-47af-9ffb-cd8aba0ccc77,DISK]
the ip address 10.0.0.8 is the internal ip address of vm.
