question Re: Apache spark read in a file from hdfs as one large string in Archives of Support Questions (Read Only)

Apache spark read in a file from hdfs as one large string

Former Member — Thu, 13 Jul 2017 04:10:45 GMT

I would like to read a large json file from hdfs as a string and then apply some string manipulations.

Not have it transformed into an rdd which is what happens with sc.textFile....

Is there a way I can do that using spark and scala.

Or do I need to read the file in another way preferably without having to look at configurations of the hive configurations files..

Thank you

Re: Apache spark read in a file from hdfs as one large string

msumbul1 — Fri, 14 Jul 2017 16:57:03 GMT

Hi,

You can do it, by create a simple connection to hdfs with hdfs client.

For example in Java, you can do the following:

Configuration confFS = new Configuration();
confFS.addResource("/etc/hadoop/conf/core-site.xml");
confFS.addResource("/etc/hadoop/conf/hdfs-site.xml");
FileSystem dfs2 = FileSystem.newInstance(confFS);

Path pt = new Path("/your/file/to/read");

BufferedReader br = new BufferedReader(new InputStreamReader(dfs2.open(pt)));
String myLine;
while ((myLine = br.readLine()) != null) {
	System.out.println(myLine);
}
br.close();
dfs2.close();

This code will create a single connection to hdfs and read a file defined in the variable pt

Re: Apache spark read in a file from hdfs as one large string

yidhir_moudoub — Thu, 05 Apr 2018 19:00:20 GMT

Hello,

I have the same problem. I read a large XML file (~1Gb) and then I do somme calculation. Have you found a solution ?

Regards,