Support Questions

nanyim_alain · ‎07-01-2016

Hello,

I work with the spark dataframe please and I would like to know how to store the data of a dataframe in a text file in the hdfs.

I tried with saveAsTextfile () but it does not workthank you

bwalter1 · ‎07-01-2016

Try

df.write.format("csv").save("/tmp/df.csv")

It will create a folder /tmp/df.csv in hdfs with part-00000 as the csv

bwalter1 · ‎07-01-2016

Since it is a dataframe (column representation) csv is the best option to have it as text file.

Any issues with csv?

sandyy006 · ‎07-01-2016

@alain TSAFACK

df.write.text("path-to-output") is what you might looking for.

nanyim_alain · ‎07-01-2016

Thank. 
But it generates an error:
AttributeError: 'DataFrameWriter' object has no attribute 'text'

bwalter1 · ‎07-01-2016

You can also use df.rdd.saveAsTextFile("/tmp/df.txt")

Again this will be a folder with a file part-00000 holding lines like [abc,42]

nanyim_alain · ‎07-01-2016

Thank you.

Please is it possible to replace part-00000 by a file name I want? Example command.txt

bwalter1 · ‎07-01-2016

I am not sure that this is what you want. If you have more than 1 spark executor then every executor will independently write parts of the data (one per each rdd partition). For example with two executors it looks like:

hdfs dfs -ls /tmp/df.txt Found 3 items
-rw-r--r--   3 root hdfs          0 2016-07-01 14:07 /tmp/df.txt/_SUCCESS
-rw-r--r--   3 root hdfs      83327 2016-07-01 14:07 /tmp/df.txt/part-00000
-rw-r--r--   3 root hdfs      83126 2016-07-01 14:07 /tmp/df.txt/part-00001

This is why the filename gets a folder. When you use this folder name as input in other Hadoop tools, they will read all files below (as if it would be one file). It is all about supporting distributed computation and writes

However if you want to force a single "part" file you need to force spark to write only with one executor

bank.rdd.repartition(1).saveAsTextFile("/tmp/df2.txt")

It then looks like

hdfs dfs -ls /tmp/df2.txt
Found 2 items
-rw-r--r--   3 root hdfs          0 2016-07-01 16:34 /tmp/df2.txt/_SUCCESS
-rw-r--r--   3 root hdfs     166453 2016-07-01 16:34 /tmp/df2.txt/part-00000

Note the size and compare with the above

You then can copy it to a file you want

hdfs dfs -cp /tmp/df2.txt/part-00000 /tmp/df3.txt

bwalter1 · ‎07-01-2016

Having described all that I still think the proper Spark way is to use

df.write.format("csv").save("/tmp/df.csv")

or

df.repartition(1).write.format("csv").save("/tmp/df.csv")

Cloudera Community

Support Questions

storage dataframe as textfile in hdfs