Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

About ports to write to HDFS in pseudo distributed mode using CDH 5.4.2.0 image

About ports to write to HDFS in pseudo distributed mode using CDH 5.4.2.0 image

Explorer

Hi, i have this scala code to write to hdfs in my laptop using this image:

 

package org.glassfish.samples

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import java.io.PrintWriter;

object AnotherApp {
  def write(uri: String, filePath: String, data: Array[Byte]): Unit = {
    //System.setProperty("HADOOP_USER_NAME", "root")
    System.setProperty("HADOOP_USER_NAME","cloudera")
    val path = new Path(filePath)
    val conf = new Configuration()
    conf.set("fs.defaultFS", uri)
    val fs = FileSystem.get(conf)
    val os = fs.create(path)
    os.write(data)
    fs.close()
  }

  def main(args: Array[String]) {
    //val conf = ConfigFactory.load()
//    write(conf.getString("hdfs.uri"), conf.getString("hdfs.result_path"), "Hello World".getBytes)
    write("hdfs://192.168.30.147:8020", "/tmp/helloworld.txt", "Hello World".getBytes)
  }
}

 I ran it and i can see the new file /tmp/helloworld.txt without content, so i suspect it is matter of ports.

 

My question is, what is it the port to write to hdfs in pseudo distributed mode. How can i see the used ports in Hue?

 

thank you

4 REPLIES 4

Re: About ports to write to HDFS in pseudo distributed mode using CDH 5.4.2.0 image

Master Collaborator
One thing to keep in mind is that even in pseudo-distributed mode, you're
still talking to multiple independent processes. The metadata will get
created with the NameNode, but the data will get written to the DataNode.
You can find details on the ports for HDFS in this blog post:
http://blog.cloudera.com/blog/2009/08/hadoop-default-ports-quick-reference/.
I'd check the logs on both your client and the daemons (both NameNode and
DataNode) and see if there's any indication of failure even after the
initial file is created.

One common misconfiguration in pseudo-distributed mode is the networking
between your client and the servers. The server processes, since they're
all on the same machine, may be configured to use '127.0.0.1' to refer to
each other. This is a problem when an external client is communicating with
those servers, because that IP address, when resolved on your client, no
longer resolves to the correct machine - so it looks to your client like
the DataNode is down.

Hope that helps...


Re: About ports to write to HDFS in pseudo distributed mode using CDH 5.4.2.0 image

Explorer

Hi again Sean, i did not see your reply until now when i have a similar problem trying to access a file programmatically. I have a vmware cloudera image, cdh-5.4.2.0 running with centos6, i am using OS X as my host machine, i have modified my /etc/hosts file with a line like this:

MacBook-Pro-Retina-de-Alonso:bin aironman$ cat /etc/hosts
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting. Do not change this entry.
##
127.0.0.1 localhost
127.0.0.1 my-cassandra-node-001
255.255.255.255 broadcasthost
::1 localhost
192.168.30.137 quickstart.cloudera quickstart



You can see that i can reach to the vmware machine from the host machine:

MacBook-Pro-Retina-de-Alonso:bin aironman$ ping quickstart.cloudera
PING quickstart.cloudera (192.168.30.137): 56 data bytes
64 bytes from 192.168.30.137: icmp_seq=0 ttl=64 time=0.293 ms
64 bytes from 192.168.30.137: icmp_seq=1 ttl=64 time=0.273 ms
64 bytes from 192.168.30.137: icmp_seq=2 ttl=64 time=0.207 ms
64 bytes from 192.168.30.137: icmp_seq=3 ttl=64 time=0.240 ms
64 bytes from 192.168.30.137: icmp_seq=4 ttl=64 time=0.402 ms
^C
--- quickstart.cloudera ping statistics ---
5 packets transmitted, 5 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.207/0.283/0.402/0.066 ms



And i can reach to 8020 port from that machine:

MacBook-Pro-Retina-de-Alonso:bin aironman$ telnet quickstart.cloudera 8020
Trying 192.168.30.137...
Connected to quickstart.cloudera.
Escape character is '^]'.



I can do a ls command in the vmware machine:

[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/ratings.csv
-rw-r--r-- 1 cloudera cloudera 16906296 2016-05-30 11:29 /user/cloudera/ratings.csv



I can read its content:

[cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/ratings.csv | wc -l
568454



But, when i try to access the file programmatically i get a IOException:

Exception in thread "main" java.io.IOException: Incomplete HDFS URI, no host: hdfs:/quickstart.cloudera:8020/ratings.csv
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:133)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2433)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:65)
at org.apache.spark.rdd.RDD$$anonfun$groupBy$1.apply(RDD.scala:620)
at org.apache.spark.rdd.RDD$$anonfun$groupBy$1.apply(RDD.scala:620)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.groupBy(RDD.scala:619)
at example.utils.Recommender.<init>(Recommender.scala:35)
at example.spark.AmazonKafkaConnector$.main(AmazonKafkaConnectorWithMongo.scala:109)
at example.spark.AmazonKafkaConnector.main(AmazonKafkaConnectorWithMongo.scala)


The code is :

val ratingFile= "hdfs:////quickstart.cloudera:8020/ratings.csv"

val rawTrainingRatings = sc.textFile(ratingFile).map {
line =>
val Array(userId, productId, scoreStr) = line.split(",")
AmazonRating(userId, productId, scoreStr.toDouble)
}


I also tried to use:

val ratingFile= "hdfs:////quickstart.cloudera:8020/user/cloudera/ratings.csv"



with the same result, IOException

I also tried to disable my local firewall with the same result.

How can i access the hdfs running in a vmware image from an external machine (my OS X laptop), what files do i have to modify in the pseudo distributed image in order to get read and write access on files located in the image? please, help.

Re: About ports to write to HDFS in pseudo distributed mode using CDH 5.4.2.0 image

Master Collaborator
My guess is the Source.fromFile API you're using does not support HDFS.
It's probably just looking for files in your native filesystem. HDFS can be
mounted in your native filesystem via NFS (at least it can on Linux - I
don't know if these options work on Mac - Cloudera definitely doesn't
support them on Mac) or FUSE. To access them directly in HDFS you'll need
to use Hadoop's own APIs. There's an example here:
https://tutorials.techmytalk.com/2014/08/16/hadoop-hdfs-java-api/. Here are
the JavaDocs specifically for the 5.4.2 release, so you can check details
of the API here:
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.4.2/api/index.html.
Highlighted

Re: About ports to write to HDFS in pseudo distributed mode using CDH 5.7.0 image

Explorer

Hi Sean, thank you for the answer, the real problem is described here, thank you very much:


https://community.cloudera.com/t5/Apache-Hadoop-Concepts-and/I-cannot-access-programmatically-a-file...