Support Questions

Find answers, ask questions, and share your expertise

storage dataframe as textfile in hdfs

avatar
Expert Contributor
Hello,

I work with the spark dataframe please and I would like to know how to store the data of a dataframe in a text file in the hdfs.

I tried with saveAsTextfile () but it does not workthank you

8 REPLIES 8

avatar

Try

df.write.format("csv").save("/tmp/df.csv")

It will create a folder /tmp/df.csv in hdfs with part-00000 as the csv

avatar

Since it is a dataframe (column representation) csv is the best option to have it as text file.

Any issues with csv?

avatar
@alain TSAFACK

df.write.text("path-to-output") is what you might looking for.

avatar
Expert Contributor
Thank. 
But it generates an error:
AttributeError: 'DataFrameWriter' object has no attribute 'text'


avatar

You can also use df.rdd.saveAsTextFile("/tmp/df.txt")

Again this will be a folder with a file part-00000 holding lines like [abc,42]

avatar
Expert Contributor

Thank you.

Please is it possible to replace part-00000 by a file name I want? Example command.txt

avatar

I am not sure that this is what you want. If you have more than 1 spark executor then every executor will independently write parts of the data (one per each rdd partition). For example with two executors it looks like:

hdfs dfs -ls /tmp/df.txt Found 3 items
-rw-r--r--   3 root hdfs          0 2016-07-01 14:07 /tmp/df.txt/_SUCCESS
-rw-r--r--   3 root hdfs      83327 2016-07-01 14:07 /tmp/df.txt/part-00000
-rw-r--r--   3 root hdfs      83126 2016-07-01 14:07 /tmp/df.txt/part-00001

This is why the filename gets a folder. When you use this folder name as input in other Hadoop tools, they will read all files below (as if it would be one file). It is all about supporting distributed computation and writes

However if you want to force a single "part" file you need to force spark to write only with one executor

bank.rdd.repartition(1).saveAsTextFile("/tmp/df2.txt")

It then looks like

hdfs dfs -ls /tmp/df2.txt
Found 2 items
-rw-r--r--   3 root hdfs          0 2016-07-01 16:34 /tmp/df2.txt/_SUCCESS
-rw-r--r--   3 root hdfs     166453 2016-07-01 16:34 /tmp/df2.txt/part-00000

Note the size and compare with the above

You then can copy it to a file you want

hdfs dfs -cp /tmp/df2.txt/part-00000 /tmp/df3.txt

avatar

Having described all that I still think the proper Spark way is to use

df.write.format("csv").save("/tmp/df.csv")

or

df.repartition(1).write.format("csv").save("/tmp/df.csv")