Support Questions

Report Inappropriate Content · ‎07-12-2017

I would like to read a large json file from hdfs as a string and then apply some string manipulations.

Not have it transformed into an rdd which is what happens with sc.textFile....

Is there a way I can do that using spark and scala.

Or do I need to read the file in another way preferably without having to look at configurations of the hive configurations files..

Thank you

msumbul1 · ‎07-14-2017

Hi,

You can do it, by create a simple connection to hdfs with hdfs client.

For example in Java, you can do the following:

Configuration confFS = new Configuration();
confFS.addResource("/etc/hadoop/conf/core-site.xml");
confFS.addResource("/etc/hadoop/conf/hdfs-site.xml");
FileSystem dfs2 = FileSystem.newInstance(confFS);

Path pt = new Path("/your/file/to/read");

BufferedReader br = new BufferedReader(new InputStreamReader(dfs2.open(pt)));
String myLine;
while ((myLine = br.readLine()) != null) {
	System.out.println(myLine);
}
br.close();
dfs2.close();

This code will create a single connection to hdfs and read a file defined in the variable pt

View solution in original post

msumbul1 · ‎07-14-2017

Hi,

You can do it, by create a simple connection to hdfs with hdfs client.

For example in Java, you can do the following:

Configuration confFS = new Configuration();
confFS.addResource("/etc/hadoop/conf/core-site.xml");
confFS.addResource("/etc/hadoop/conf/hdfs-site.xml");
FileSystem dfs2 = FileSystem.newInstance(confFS);

Path pt = new Path("/your/file/to/read");

BufferedReader br = new BufferedReader(new InputStreamReader(dfs2.open(pt)));
String myLine;
while ((myLine = br.readLine()) != null) {
	System.out.println(myLine);
}
br.close();
dfs2.close();

This code will create a single connection to hdfs and read a file defined in the variable pt

yidhir_moudoub · ‎04-05-2018

Hello,

I have the same problem. I read a large XML file (~1Gb) and then I do somme calculation. Have you found a solution ?

Regards,

Cloudera Community

Support Questions

Apache spark read in a file from hdfs as one large string