Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Apache spark read in a file from hdfs as one large string

Solved Go to solution

Apache spark read in a file from hdfs as one large string

I would like to read a large json file from hdfs as a string and then apply some string manipulations.

Not have it transformed into an rdd which is what happens with sc.textFile....

Is there a way I can do that using spark and scala.

Or do I need to read the file in another way preferably without having to look at configurations of the hive configurations files..

Thank you

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Apache spark read in a file from hdfs as one large string

Expert Contributor

Hi,

You can do it, by create a simple connection to hdfs with hdfs client.

For example in Java, you can do the following:

Configuration confFS = new Configuration();
confFS.addResource("/etc/hadoop/conf/core-site.xml");
confFS.addResource("/etc/hadoop/conf/hdfs-site.xml");
FileSystem dfs2 = FileSystem.newInstance(confFS);

Path pt = new Path("/your/file/to/read");

BufferedReader br = new BufferedReader(new InputStreamReader(dfs2.open(pt)));
String myLine;
while ((myLine = br.readLine()) != null) {
	System.out.println(myLine);
}
br.close();
dfs2.close();

This code will create a single connection to hdfs and read a file defined in the variable pt

View solution in original post

2 REPLIES 2
Highlighted

Re: Apache spark read in a file from hdfs as one large string

Expert Contributor

Hi,

You can do it, by create a simple connection to hdfs with hdfs client.

For example in Java, you can do the following:

Configuration confFS = new Configuration();
confFS.addResource("/etc/hadoop/conf/core-site.xml");
confFS.addResource("/etc/hadoop/conf/hdfs-site.xml");
FileSystem dfs2 = FileSystem.newInstance(confFS);

Path pt = new Path("/your/file/to/read");

BufferedReader br = new BufferedReader(new InputStreamReader(dfs2.open(pt)));
String myLine;
while ((myLine = br.readLine()) != null) {
	System.out.println(myLine);
}
br.close();
dfs2.close();

This code will create a single connection to hdfs and read a file defined in the variable pt

View solution in original post

Highlighted

Re: Apache spark read in a file from hdfs as one large string

New Contributor

Hello,

I have the same problem. I read a large XML file (~1Gb) and then I do somme calculation. Have you found a solution ?

Regards,

Don't have an account?
Coming from Hortonworks? Activate your account here