Created on 05-30-2016 10:51 AM - edited 09-16-2022 03:22 AM
I have a vmware cloudera image, cdh-5.4.2.0 running with centos6, i am using OS X as my host machine, i have modified my /etc/hosts file with a line like this:
MacBook-Pro-Retina-de-Alonso:bin aironman$ cat /etc/hosts 127.0.0.1 localhost 127.0.0.1 my-cassandra-node-001 255.255.255.255 broadcasthost ::1 localhost 192.168.30.137 quickstart.cloudera quickstart
You can see that i can reach to the vmware machine from the host machine:
$:bin aironman$ ping quickstart.cloudera PING quickstart.cloudera (192.168.30.137): 56 data bytes 64 bytes from 192.168.30.137: icmp_seq=0 ttl=64 time=0.293 ms 64 bytes from 192.168.30.137: icmp_seq=1 ttl=64 time=0.273 ms 64 bytes from 192.168.30.137: icmp_seq=2 ttl=64 time=0.207 ms 64 bytes from 192.168.30.137: icmp_seq=3 ttl=64 time=0.240 ms 64 bytes from 192.168.30.137: icmp_seq=4 ttl=64 time=0.402 ms ^C --- quickstart.cloudera ping statistics --- 5 packets transmitted, 5 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 0.207/0.283/0.402/0.066 ms
And i can reach to 8020 port from that machine:
$:bin aironman$ telnet quickstart.cloudera 8020 Trying 192.168.30.137... Connected to quickstart.cloudera. Escape character is '^]'.
I can do a ls command in the vmware machine:
[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/ratings.csv -rw-r--r-- 1 cloudera cloudera 16906296 2016-05-30 11:29 /user/cloudera/ratings.csv
I can read its content:
[cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/ratings.csv | wc -l 568454
The code is quite simple, just trying to map its content:
val ratingFile="hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv" case class AmazonRating(userId: String, productId: String, rating: Double) val NumRecommendations = 10 val MinRecommendationsPerUser = 10 val MaxRecommendationsPerUser = 20 val MyUsername = "myself" val NumPartitions = 20 println("Using this ratingFile: " + ratingFile) // first create an RDD out of the rating file val rawTrainingRatings = sc.textFile(ratingFile).map { line => val Array(userId, productId, scoreStr) = line.split(",") AmazonRating(userId, productId, scoreStr.toDouble) } // only keep users that have rated between MinRecommendationsPerUser and MaxRecommendationsPerUser products val trainingRatings = rawTrainingRatings.groupBy(_.userId).filter(r => MinRecommendationsPerUser <= r._2.size && r._2.size < MaxRecommendationsPerUser).flatMap(_._2).repartition(NumPartitions).cache() println(s"Parsed $ratingFile. Kept ${trainingRatings.count()} ratings out of ${rawTrainingRatings.count()}")
I am getting this message:
Parsed hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv. Kept 0 ratings out of 568454
because if i run the exact code in the spark-shell, i got this message:
Parsed hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv. Kept 73279 ratings out of 568454
Why is it working fine within the spark-shell but it is not programmatically running in the vmware image?
Thank your for reading until here.
Created 06-03-2016 08:32 AM
It looks like the networking issue is resolved with the changes to hostfiles. For the remaining issues you may have better luck posting in the Spark forum specifically (http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/bd-p/Spark) - I suspect outside of the forum there won't be that many readers familiar with the tricker parts of Spark configuration and SBT-pack in particular.
Created 06-02-2016 12:31 PM
Created 06-03-2016 02:11 AM
Created 06-03-2016 01:44 AM
Created 06-03-2016 02:40 AM
Hi Sean,
doing this provokes that spark-worker doesn't run and Hue does not work properly, i think that it needs internally to use quickstart.cloudera.
If i restart again the vmware image and i redo again what i did within /etc/init.d/cloudera-quickstart-init with the call to cloudera-quickstart-ip, i can login again to Hue, and spark-history-manager runs properly:
I have noticed that running a ps xa | grep spark...
[cloudera@quickstart ~]$ ps xa | grep spark 6330 ? Sl 0:03 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp /usr/lib/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/* -Dspark.deploy.defaultCores=4 -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master 6499 ? Sl 0:04 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp /usr/lib/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/* -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker spark://quickstart.cloudera:7077 6674 ? Sl 0:05 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp /etc/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/* -Dspark.history.fs.logDirectory=hdfs:///user/spark/applicationHistory -Dspark.history.ui.port=18088 -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.history.HistoryServer 6915 pts/0 R+ 0:00 grep spark
As you can see, spark-master runs with 4 cores (-Dspark.deploy.defaultCores=4) with no dedicated cores to worker, is it normal?
[cloudera@quickstart ~]$ sudo service spark-worker status Spark worker is running [ OK ] [cloudera@quickstart ~]$ sudo service spark-master status Spark master is running [ OK ] [cloudera@quickstart ~]$ sudo service spark-history-server status Spark history-server is running [ OK ]
As you can see, it looks normal, but examining http://quickstart.cloudera:18080/ , the spark-master, i can see that
URL: spark://192.168.30.137:7077 REST URL: spark://192.168.30.137:6066 (cluster mode) Alive Workers: 0 Cores in use: 0 Total, 0 Used Memory in use: 0.0 B Total, 0.0 B Used Applications: 0 Running, 0 Completed Drivers: 0 Running, 0 Completed Status: ALIVE
zero cores in use from zero in total! with no memory, that´s strange, because spark-master tries to use 4 cores (-Dspark.deploy.defaultCores=4) and 1GB (-Xms1g -Xmx1g -XX:MaxPermSize=256m)
Then, you can see this output from spark-worker:
ID: worker-20160603111341-192.168.30.137-7078 Master URL: Cores: 4 (0 Used) Memory: 6.7 GB (0.0 B Used)
the master URL is not setup and it is using 4 cores with 6.7 GB when spark-worker is running with this setup:
-Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker spark://quickstart.cloudera:7077
What do you think, what can i do in order to continue developing my project? because, that is what i want, use this vmware image to develop this project using the hdfs to load a tiny file, only 16MB.
What most annoys me is that the code works perfectly in the spark-shell of the virtual image but when I try to make it run programmatically creating the unix with SBT-pack command does not work.
Regards
Alonso
Created 06-03-2016 08:32 AM
It looks like the networking issue is resolved with the changes to hostfiles. For the remaining issues you may have better luck posting in the Spark forum specifically (http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/bd-p/Spark) - I suspect outside of the forum there won't be that many readers familiar with the tricker parts of Spark configuration and SBT-pack in particular.
Created 06-04-2016 07:49 AM