Support Questions

Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

storage dataframe as textfile in hdfs

Rising Star

I work with the spark dataframe please and I would like to know how to store the data of a dataframe in a text file in the hdfs.

I tried with saveAsTextfile () but it does not workthank you




It will create a folder /tmp/df.csv in hdfs with part-00000 as the csv

Since it is a dataframe (column representation) csv is the best option to have it as text file.

Any issues with csv?

@alain TSAFACK

df.write.text("path-to-output") is what you might looking for.

Rising Star
But it generates an error:
AttributeError: 'DataFrameWriter' object has no attribute 'text'

You can also use df.rdd.saveAsTextFile("/tmp/df.txt")

Again this will be a folder with a file part-00000 holding lines like [abc,42]

Rising Star

Thank you.

Please is it possible to replace part-00000 by a file name I want? Example command.txt

I am not sure that this is what you want. If you have more than 1 spark executor then every executor will independently write parts of the data (one per each rdd partition). For example with two executors it looks like:

hdfs dfs -ls /tmp/df.txt Found 3 items
-rw-r--r--   3 root hdfs          0 2016-07-01 14:07 /tmp/df.txt/_SUCCESS
-rw-r--r--   3 root hdfs      83327 2016-07-01 14:07 /tmp/df.txt/part-00000
-rw-r--r--   3 root hdfs      83126 2016-07-01 14:07 /tmp/df.txt/part-00001

This is why the filename gets a folder. When you use this folder name as input in other Hadoop tools, they will read all files below (as if it would be one file). It is all about supporting distributed computation and writes

However if you want to force a single "part" file you need to force spark to write only with one executor


It then looks like

hdfs dfs -ls /tmp/df2.txt
Found 2 items
-rw-r--r--   3 root hdfs          0 2016-07-01 16:34 /tmp/df2.txt/_SUCCESS
-rw-r--r--   3 root hdfs     166453 2016-07-01 16:34 /tmp/df2.txt/part-00000

Note the size and compare with the above

You then can copy it to a file you want

hdfs dfs -cp /tmp/df2.txt/part-00000 /tmp/df3.txt

Having described all that I still think the proper Spark way is to use



Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.