Support Questions

Find answers, ask questions, and share your expertise

Apache spark read in a file from hdfs as one large string

avatar

I would like to read a large json file from hdfs as a string and then apply some string manipulations.

Not have it transformed into an rdd which is what happens with sc.textFile....

Is there a way I can do that using spark and scala.

Or do I need to read the file in another way preferably without having to look at configurations of the hive configurations files..

Thank you

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Hi,

You can do it, by create a simple connection to hdfs with hdfs client.

For example in Java, you can do the following:

Configuration confFS = new Configuration();
confFS.addResource("/etc/hadoop/conf/core-site.xml");
confFS.addResource("/etc/hadoop/conf/hdfs-site.xml");
FileSystem dfs2 = FileSystem.newInstance(confFS);

Path pt = new Path("/your/file/to/read");

BufferedReader br = new BufferedReader(new InputStreamReader(dfs2.open(pt)));
String myLine;
while ((myLine = br.readLine()) != null) {
	System.out.println(myLine);
}
br.close();
dfs2.close();

This code will create a single connection to hdfs and read a file defined in the variable pt

View solution in original post

2 REPLIES 2

avatar
Expert Contributor

Hi,

You can do it, by create a simple connection to hdfs with hdfs client.

For example in Java, you can do the following:

Configuration confFS = new Configuration();
confFS.addResource("/etc/hadoop/conf/core-site.xml");
confFS.addResource("/etc/hadoop/conf/hdfs-site.xml");
FileSystem dfs2 = FileSystem.newInstance(confFS);

Path pt = new Path("/your/file/to/read");

BufferedReader br = new BufferedReader(new InputStreamReader(dfs2.open(pt)));
String myLine;
while ((myLine = br.readLine()) != null) {
	System.out.println(myLine);
}
br.close();
dfs2.close();

This code will create a single connection to hdfs and read a file defined in the variable pt

avatar
New Contributor

Hello,

I have the same problem. I read a large XML file (~1Gb) and then I do somme calculation. Have you found a solution ?

Regards,