Created 07-01-2016 08:35 AM
I work with the spark dataframe please and I would like to know how to store the data of a dataframe in a text file in the hdfs.
I tried with saveAsTextfile () but it does not workthank you
Created 07-01-2016 09:08 AM
Try
df.write.format("csv").save("/tmp/df.csv")
It will create a folder /tmp/df.csv in hdfs with part-00000 as the csv
Created 07-01-2016 12:03 PM
Since it is a dataframe (column representation) csv is the best option to have it as text file.
Any issues with csv?
Created 07-01-2016 11:04 AM
df.write.text("path-to-output") is what you might looking for.
Created 07-01-2016 11:12 AM
Thank. But it generates an error: AttributeError: 'DataFrameWriter' object has no attribute 'text'
Created 07-01-2016 12:10 PM
You can also use df.rdd.saveAsTextFile("/tmp/df.txt")
Again this will be a folder with a file part-00000 holding lines like [abc,42]
Created 07-01-2016 01:55 PM
Thank you.
Please is it possible to replace part-00000 by a file name I want? Example command.txt
Created 07-01-2016 02:40 PM
I am not sure that this is what you want. If you have more than 1 spark executor then every executor will independently write parts of the data (one per each rdd partition). For example with two executors it looks like:
hdfs dfs -ls /tmp/df.txt Found 3 items -rw-r--r-- 3 root hdfs 0 2016-07-01 14:07 /tmp/df.txt/_SUCCESS -rw-r--r-- 3 root hdfs 83327 2016-07-01 14:07 /tmp/df.txt/part-00000 -rw-r--r-- 3 root hdfs 83126 2016-07-01 14:07 /tmp/df.txt/part-00001
This is why the filename gets a folder. When you use this folder name as input in other Hadoop tools, they will read all files below (as if it would be one file). It is all about supporting distributed computation and writes
However if you want to force a single "part" file you need to force spark to write only with one executor
bank.rdd.repartition(1).saveAsTextFile("/tmp/df2.txt")
It then looks like
hdfs dfs -ls /tmp/df2.txt Found 2 items -rw-r--r-- 3 root hdfs 0 2016-07-01 16:34 /tmp/df2.txt/_SUCCESS -rw-r--r-- 3 root hdfs 166453 2016-07-01 16:34 /tmp/df2.txt/part-00000
Note the size and compare with the above
You then can copy it to a file you want
hdfs dfs -cp /tmp/df2.txt/part-00000 /tmp/df3.txt
Created 07-01-2016 02:43 PM
Having described all that I still think the proper Spark way is to use
df.write.format("csv").save("/tmp/df.csv")
or
df.repartition(1).write.format("csv").save("/tmp/df.csv")