Support Questions

aironMan · ‎05-30-2016

I have a vmware cloudera image, cdh-5.4.2.0 running with centos6, i am using OS X as my host machine, i have modified my /etc/hosts file with a line like this:

MacBook-Pro-Retina-de-Alonso:bin aironman$ cat /etc/hosts
127.0.0.1 localhost
127.0.0.1 my-cassandra-node-001
255.255.255.255 broadcasthost
::1 localhost
192.168.30.137 quickstart.cloudera quickstart

You can see that i can reach to the vmware machine from the host machine:

$:bin aironman$ ping quickstart.cloudera
PING quickstart.cloudera (192.168.30.137): 56 data bytes
64 bytes from 192.168.30.137: icmp_seq=0 ttl=64 time=0.293 ms
64 bytes from 192.168.30.137: icmp_seq=1 ttl=64 time=0.273 ms
64 bytes from 192.168.30.137: icmp_seq=2 ttl=64 time=0.207 ms
64 bytes from 192.168.30.137: icmp_seq=3 ttl=64 time=0.240 ms
64 bytes from 192.168.30.137: icmp_seq=4 ttl=64 time=0.402 ms
^C
--- quickstart.cloudera ping statistics ---
5 packets transmitted, 5 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.207/0.283/0.402/0.066 ms

And i can reach to 8020 port from that machine:

$:bin aironman$ telnet quickstart.cloudera 8020
Trying 192.168.30.137...
Connected to quickstart.cloudera.
Escape character is '^]'.

I can do a ls command in the vmware machine:

[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/ratings.csv
-rw-r--r-- 1 cloudera cloudera 16906296 2016-05-30 11:29 /user/cloudera/ratings.csv

I can read its content:

[cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/ratings.csv | wc -l
568454

The code is quite simple, just trying to map its content:

val ratingFile="hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv"

case class AmazonRating(userId: String, productId: String, rating: Double)

val NumRecommendations = 10
val MinRecommendationsPerUser = 10
val MaxRecommendationsPerUser = 20
val MyUsername = "myself"
val NumPartitions = 20

  
println("Using this ratingFile: " + ratingFile)
  // first create an RDD out of the rating file
val rawTrainingRatings = sc.textFile(ratingFile).map {
    line =>
      val Array(userId, productId, scoreStr) = line.split(",")
      AmazonRating(userId, productId, scoreStr.toDouble)
}

  // only keep users that have rated between MinRecommendationsPerUser and MaxRecommendationsPerUser products
val trainingRatings = rawTrainingRatings.groupBy(_.userId).filter(r => MinRecommendationsPerUser <= r._2.size  && r._2.size < MaxRecommendationsPerUser).flatMap(_._2).repartition(NumPartitions).cache()

println(s"Parsed $ratingFile. Kept ${trainingRatings.count()} ratings out of ${rawTrainingRatings.count()}")

I am getting this message:

Parsed hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv. Kept 0 ratings out of 568454

because if i run the exact code in the spark-shell, i got this message:

Parsed hdfs://quickstart.cloudera:8020/user/cloudera/ratings.csv. Kept 73279 ratings out of 568454

Why is it working fine within the spark-shell but it is not programmatically running in the vmware image?

Thank your for reading until here.

Sean · ‎06-03-2016

It looks like the networking issue is resolved with the changes to hostfiles. For the remaining issues you may have better luck posting in the Spark forum specifically (http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/bd-p/Spark) - I suspect outside of the forum there won't be that many readers familiar with the tricker parts of Spark configuration and SBT-pack in particular.

View solution in original post

Sean · ‎06-02-2016

Also, note that there's a script that tries to detect a public IP and set
up the hosts file for you on boot. If you're going to edit it manually, you
probably want to comment out the line in
/etc/init.d/cloudera-quickstart-init that calls
/usr/bin/cloudera-quickstart-ip. I don't remember which version that was
added in. It might have been 5.5 - so if your VM doesn't have
/usr/bin/cloudera-quickstart-ip you can ignore this post and safely edit
the hosts file anyway.

aironMan · ‎06-03-2016

Hi Sean, that change does not have any effect, spark-worker doesn't run, even if i try to restart it manually with sudo service spark-worker restart, it failed as soon as i ask by its status.

Also, Hue does not work, i think it happens because internally uses quickstart.cloudera in order to talk with another component...

I am starting to think that this vmware image is useless to develop something related with spark, i cannot run anything...

aironMan · ‎06-03-2016

Hi Sean

this is how it looks /etc/hosts by default, when the image is restarted:

[cloudera@quickstart ~]$ cat /etc/hosts
127.0.0.1 quickstart.cloudera quickstart localhost localhost.domain

aironMan · ‎06-03-2016

Hi Sean,

doing this provokes that spark-worker doesn't run and Hue does not work properly, i think that it needs internally to use quickstart.cloudera.

If i restart again the vmware image and i redo again what i did within /etc/init.d/cloudera-quickstart-init with the call to cloudera-quickstart-ip, i can login again to Hue, and spark-history-manager runs properly:

I have noticed that running a ps xa | grep spark...

[cloudera@quickstart ~]$ ps xa | grep spark
 6330 ?        Sl     0:03 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp /usr/lib/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/* -Dspark.deploy.defaultCores=4 -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master
 6499 ?        Sl     0:04 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp /usr/lib/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/* -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker spark://quickstart.cloudera:7077
 6674 ?        Sl     0:05 /usr/java/jdk1.7.0_67-cloudera/bin/java -cp /etc/spark/conf/:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar:/etc/hadoop/conf/:/usr/lib/spark/lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/* -Dspark.history.fs.logDirectory=hdfs:///user/spark/applicationHistory -Dspark.history.ui.port=18088 -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.history.HistoryServer
 6915 pts/0    R+     0:00 grep spark

As you can see, spark-master runs with 4 cores (-Dspark.deploy.defaultCores=4) with no dedicated cores to worker, is it normal?

[cloudera@quickstart ~]$ sudo service spark-worker status
Spark worker is running                                    [  OK  ]
[cloudera@quickstart ~]$ sudo service spark-master status
Spark master is running                                    [  OK  ]
[cloudera@quickstart ~]$ sudo service spark-history-server status
Spark history-server is running                            [  OK  ]

As you can see, it looks normal, but examining http://quickstart.cloudera:18080/ , the spark-master, i can see that

    URL: spark://192.168.30.137:7077
    REST URL: spark://192.168.30.137:6066 (cluster mode)
    Alive Workers: 0
    Cores in use: 0 Total, 0 Used
    Memory in use: 0.0 B Total, 0.0 B Used
    Applications: 0 Running, 0 Completed
    Drivers: 0 Running, 0 Completed
    Status: ALIVE

zero cores in use from zero in total! with no memory, that´s strange, because spark-master tries to use 4 cores (-Dspark.deploy.defaultCores=4) and 1GB (-Xms1g -Xmx1g -XX:MaxPermSize=256m)

Then, you can see this output from spark-worker:

    ID: worker-20160603111341-192.168.30.137-7078
    Master URL:
    Cores: 4 (0 Used)
    Memory: 6.7 GB (0.0 B Used)

the master URL is not setup and it is using 4 cores with 6.7 GB when spark-worker is running with this setup:

-Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker spark://quickstart.cloudera:7077

What do you think, what can i do in order to continue developing my project? because, that is what i want, use this vmware image to develop this project using the hdfs to load a tiny file, only 16MB.

What most annoys me is that the code works perfectly in the spark-shell of the virtual image but when I try to make it run programmatically creating the unix with SBT-pack command does not work.

Regards

Alonso

Sean · ‎06-03-2016

It looks like the networking issue is resolved with the changes to hostfiles. For the remaining issues you may have better luck posting in the Spark forum specifically (http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/bd-p/Spark) - I suspect outside of the forum there won't be that many readers familiar with the tricker parts of Spark configuration and SBT-pack in particular.

aironMan · ‎06-04-2016

Thank you Sean, but the link that you have provided are returning this message:

The core node you are trying to access was not found, it may have been deleted. Please refresh your original page and try the operation again.

EDIT, now i have noticed that there is an extra ")".

Cloudera Community

Support Questions

I cannot access programmatically a file within a CDH 5.4.2.0 image running in vmware