Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Apache spark read in a file from hdfs as one large string

avatar
Not applicable

I would like to read a large json file from hdfs as a string and then apply some string manipulations.

Not have it transformed into an rdd which is what happens with sc.textFile....

Is there a way I can do that using spark and scala.

Or do I need to read the file in another way preferably without having to look at configurations of the hive configurations files..

Thank you

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Hi,

You can do it, by create a simple connection to hdfs with hdfs client.

For example in Java, you can do the following:

Configuration confFS = new Configuration();
confFS.addResource("/etc/hadoop/conf/core-site.xml");
confFS.addResource("/etc/hadoop/conf/hdfs-site.xml");
FileSystem dfs2 = FileSystem.newInstance(confFS);

Path pt = new Path("/your/file/to/read");

BufferedReader br = new BufferedReader(new InputStreamReader(dfs2.open(pt)));
String myLine;
while ((myLine = br.readLine()) != null) {
	System.out.println(myLine);
}
br.close();
dfs2.close();

This code will create a single connection to hdfs and read a file defined in the variable pt

View solution in original post

2 REPLIES 2

avatar
Expert Contributor

Hi,

You can do it, by create a simple connection to hdfs with hdfs client.

For example in Java, you can do the following:

Configuration confFS = new Configuration();
confFS.addResource("/etc/hadoop/conf/core-site.xml");
confFS.addResource("/etc/hadoop/conf/hdfs-site.xml");
FileSystem dfs2 = FileSystem.newInstance(confFS);

Path pt = new Path("/your/file/to/read");

BufferedReader br = new BufferedReader(new InputStreamReader(dfs2.open(pt)));
String myLine;
while ((myLine = br.readLine()) != null) {
	System.out.println(myLine);
}
br.close();
dfs2.close();

This code will create a single connection to hdfs and read a file defined in the variable pt

avatar
New Member

Hello,

I have the same problem. I read a large XML file (~1Gb) and then I do somme calculation. Have you found a solution ?

Regards,