Created 07-07-2018 08:38 AM
Hello,
I have setup my hdp cluster in azure vms which has 3 node. one master and 2 slaves.
i'm connect my master using putty client and open spark-shell. run spark jobs everything works fine.
now i'have setup eclipse scala ide to develop spark application. but i would like to directly connect from my scala ide to the cluster hdfs where my data is stored and would like to run the program.
i have setup below configuration :
val conf = new SparkConf() .setAppName("SparkApp") .setMaster("yarn-client") .set("spark.hadoop.fs.defaultFS", "hdfs://13.76.44.223") .set("spark.hadoop.dfs.nameservices", "13.76.44.223:8020") .set("spark.hadoop.yarn.resourcemanager.hostname", "13.76.44.223") .set("spark.hadoop.yarn.resourcemanager.address", "13.76.44.223:8050").set("spark.driver.host","127.0.0.1") //this my local ip .set("spark.local.ip", "13.76.44.223") //cdh vmnat ip .set("spark.yarn.jar", "hdfs://13.76.44.223:8020/usr/hdp/2.6.5.0-292/spark2/jars/*.jar") .set("mapreduce.app-submission.cross-platform", "true")
val sc = new SparkContext(conf) val file = sc.textFile("/user/hdfs/file.txt")
val words = file.flatMap { line => line.split(" ") }
val wordsmap = words.map { word => (word,1) }
val wordcount = wordsmap.reduceByKey((x,y)=> x+y)
wordcount.collect.foreach(println)
if i run above program from eclipse i'm getting
1. "Error initializing SparkContext. org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/hdfs/.sparkStaging/application_1530938503330_0168/__spark_conf__.zip could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and 2 node(s) are excluded in this operation. "
2. "org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/hdfs/.sparkStaging/application_1530938503330_0168/__spark_conf__.zip could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and 2 node(s) are excluded in this operation."
note : i have open all the necessary port which is used in above config.
please anyone can help me do i missing anything.
Created 07-07-2018 11:15 AM
@mehul godhaniya You need to copy the configuration files hdfs-site.xml, core-site.xml, yarn-site.xml, mapred-site.xml from azure vms to your machine and place them in the eclipse project resource directory so that it will be added to classpath. From classpath those files will be automatically read as configuration for the application.
Once that is done you can simplify your code, especially for SparkConf creation and avoid any missing configuration:
val conf = new SparkConf() .setAppName("SparkApp").setMaster("yarn-client")
val sc = new SparkContext(conf) 
val file = sc.textFile("/user/hdfs/file.txt")
val words = file.flatMap { line => line.split(" ") }
val wordsmap = words.map { word => (word,1) }
val wordcount = wordsmap.reduceByKey((x,y)=> x+y)
wordcount.collect.foreach(println)HTH
*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.
Created 07-09-2018 12:46 PM
@mehul godhaniya the error indicates spark is not able to write __spark_conf__.zip to hdfs. Have you checked your datanodes are up and running and that you can access hdfs correctly from your machine?
Created 07-08-2018 06:41 AM
Hello @Felix Albani Thanks for your response. i have download all the files from the my vm's and added to the resources folder in my project. however i still get the error "org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/hdfs/.sparkStaging/application_1531024403580_0103/__spark_conf__.zip could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and 2 node(s) are excluded in this operation."
System.setProperty("SPARK_YARN_MODE", "true")
System.setProperty("HADOOP_USER_NAME", "hdfs") //using cloudera user
val conf = new SparkConf() .setAppName("SparkApp") .
setMaster("yarn-client")
.set("spark.yarn.jars", "hdfs://104.215.158.249:8020/usr/hdp/2.6.5.0-292/spark2/jars/*.jar") ///user/hdfs/file.txt
val sc = new SparkContext(conf)
val file = sc.textFile("/user/hdfs/file.txt")
val words = file.flatMap { line => line.split(" ") }
val wordsmap = words.map { word => (word,1) }
val wordcount = wordsmap.reduceByKey((x,y)=> x+y)
    wordcount.collect.foreach(println)
    sc.stop()
this is the modified code.
any help.
Created 07-14-2018 08:44 AM
@Felix Albani thanks for the answer. my datanode is up and running. and it works fine when i run any command from my cluster.
However , problem occurred when i try to use eclipse from my local windows machine and connect it using spark. my cluster is in azure with 3 vms.
below my understanding:
1. All the vm have public and internal ip.
2. namenode is successfully connected with public ip.
3. however datanode is not able to connect using public ip the reason when namenode provide list of ip to write data it provides internal ip address.
below is exception :
Excluding datanode DatanodeInfoWithStorage[10.0.0.8:50010,DS-fa8a8432-25c6-47af-9ffb-cd8aba0ccc77,DISK]
the ip address 10.0.0.8 is the internal ip address of vm.
 
					
				
				
			
		
