Support Questions
Find answers, ask questions, and share your expertise

I cannot access programmatically a file within a CDH image running in vmware


I have a vmware cloudera image, cdh- running with centos6, i am using OS X as my host machine, i have modified my /etc/hosts file with a line like this:

MacBook-Pro-Retina-de-Alonso:bin aironman$ cat /etc/hosts localhost my-cassandra-node-001 broadcasthost
::1 localhost quickstart.cloudera quickstart

You can see that i can reach to the vmware machine from the host machine:

$:bin aironman$ ping quickstart.cloudera
PING quickstart.cloudera ( 56 data bytes
64 bytes from icmp_seq=0 ttl=64 time=0.293 ms
64 bytes from icmp_seq=1 ttl=64 time=0.273 ms
64 bytes from icmp_seq=2 ttl=64 time=0.207 ms
64 bytes from icmp_seq=3 ttl=64 time=0.240 ms
64 bytes from icmp_seq=4 ttl=64 time=0.402 ms
--- quickstart.cloudera ping statistics ---
5 packets transmitted, 5 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.207/0.283/0.402/0.066 ms

And i can reach to 8020 port from that machine:

$:bin aironman$ telnet quickstart.cloudera 8020
Connected to quickstart.cloudera.
Escape character is '^]'.

I can do a ls command in the vmware machine:

[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/ratings.csv
-rw-r--r-- 1 cloudera cloudera 16906296 2016-05-30 11:29 /user/cloudera/ratings.csv

I can read its content:

[cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/ratings.csv | wc -l

The code is quite simple, just trying to map its content:


val ratingFile="hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv"

case class AmazonRating(userId: String, productId: String, rating: Double)

val NumRecommendations = 10
val MinRecommendationsPerUser = 10
val MaxRecommendationsPerUser = 20
val MyUsername = "myself"
val NumPartitions = 20

println("Using this ratingFile: " + ratingFile)
  // first create an RDD out of the rating file
val rawTrainingRatings = sc.textFile(ratingFile).map {
    line =>
      val Array(userId, productId, scoreStr) = line.split(",")
      AmazonRating(userId, productId, scoreStr.toDouble)

  // only keep users that have rated between MinRecommendationsPerUser and MaxRecommendationsPerUser products
val trainingRatings = rawTrainingRatings.groupBy(_.userId).filter(r => MinRecommendationsPerUser <= r._2.size  && r._2.size < MaxRecommendationsPerUser).flatMap(_._2).repartition(NumPartitions).cache()

println(s"Parsed $ratingFile. Kept ${trainingRatings.count()} ratings out of ${rawTrainingRatings.count()}")

I am getting this message:


Parsed hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv. Kept 0 ratings out of 568454

because if i run the exact code in the spark-shell, i got this message:

Parsed hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv. Kept 73279 ratings out of 568454

Why is it working fine within the spark-shell but it is not programmatically running in the vmware image?

Thank your for reading until here.


Master Collaborator
Also, note that there's a script that tries to detect a public IP and set
up the hosts file for you on boot. If you're going to edit it manually, you
probably want to comment out the line in
/etc/init.d/cloudera-quickstart-init that calls
/usr/bin/cloudera-quickstart-ip. I don't remember which version that was
added in. It might have been 5.5 - so if your VM doesn't have
/usr/bin/cloudera-quickstart-ip you can ignore this post and safely edit
the hosts file anyway.

Hi Sean, that change does not have any effect, spark-worker doesn't run, even if i try to restart it manually with sudo service spark-worker restart, it failed as soon as i ask by its status.

Also, Hue does not work, i think it happens because internally uses quickstart.cloudera in order to talk with another component...

I am starting to think that this vmware image is useless to develop something related with spark, i cannot run anything...

Hi Sean

this is how it looks /etc/hosts by default, when the image is restarted:

[cloudera@quickstart ~]$ cat /etc/hosts quickstart.cloudera quickstart localhost localhost.domain


Hi Sean,


doing this provokes that spark-worker doesn't run and Hue does not work properly, i think that it needs internally to use quickstart.cloudera.

If i restart again the vmware image and i redo again what i did within /etc/init.d/cloudera-quickstart-init with the call to cloudera-quickstart-ip, i can login again to Hue, and spark-history-manager runs properly:


I have noticed that running a ps xa | grep spark...


[cloudera@quickstart ~]$ ps xa | grep spark
 6330 ?        Sl     0:03 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp /usr/lib/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/* -Dspark.deploy.defaultCores=4 -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master
 6499 ?        Sl     0:04 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp /usr/lib/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/* -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker spark://quickstart.cloudera:7077
 6674 ?        Sl     0:05 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp /etc/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/* -Dspark.history.fs.logDirectory=hdfs:///user/spark/applicationHistory -Dspark.history.ui.port=18088 -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.history.HistoryServer
 6915 pts/0    R+     0:00 grep spark

As you can see, spark-master runs with 4 cores (-Dspark.deploy.defaultCores=4) with no dedicated cores to worker, is it normal? 


[cloudera@quickstart ~]$ sudo service spark-worker status
Spark worker is running                                    [  OK  ]
[cloudera@quickstart ~]$ sudo service spark-master status
Spark master is running                                    [  OK  ]
[cloudera@quickstart ~]$ sudo service spark-history-server status
Spark history-server is running                            [  OK  ]

As you can see, it looks normal, but examining http://quickstart.cloudera:18080/ , the spark-master, i can see that 


    URL: spark://
    REST URL: spark:// (cluster mode)
    Alive Workers: 0
    Cores in use: 0 Total, 0 Used
    Memory in use: 0.0 B Total, 0.0 B Used
    Applications: 0 Running, 0 Completed
    Drivers: 0 Running, 0 Completed
    Status: ALIVE

zero cores in use from zero in total! with no memory, that´s strange, because spark-master tries to use 4 cores (-Dspark.deploy.defaultCores=4) and 1GB (-Xms1g -Xmx1g -XX:MaxPermSize=256m)


Then, you can see this output from spark-worker:


    ID: worker-20160603111341-
    Master URL:
    Cores: 4 (0 Used)
    Memory: 6.7 GB (0.0 B Used)

the master URL is not setup and it is using 4 cores with 6.7 GB when spark-worker is running with this setup:


-Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker spark://quickstart.cloudera:7077

What do you think, what can i do in order to continue developing my project? because, that is what i want, use this vmware image to develop this project using the hdfs to load a tiny file, only 16MB.


What most annoys me is that the code works perfectly in the spark-shell  of the virtual image but when I try to make it run programmatically creating the unix with SBT-pack command does not work.






It looks like the networking issue is resolved with the changes to hostfiles. For the remaining issues you may have better luck posting in the Spark forum specifically ( - I suspect outside of the forum there won't be that many readers familiar with the tricker parts of Spark configuration and SBT-pack in particular.

View solution in original post

Thank you Sean, but the link that you have provided are returning this message:

The core node you are trying to access was not found, it may have been deleted. Please refresh your original page and try the operation again.

EDIT, now i have noticed that there is an extra ")".