- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How do you write a RDD as a tab delimited file in pyspark?
- Labels:
-
Apache Spark
Created on
‎07-13-2016
05:37 AM
- last edited on
‎09-19-2022
11:45 AM
by
cjervis
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a RDD I'd like to write as tab delimited. I also want to write a data frame as tab delimited. How do I do this?
Created ‎07-18-2016 03:42 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try this, but this version is for version 1.5 and up
data.write.format('com.databricks.spark.csv').options(delimiter="\t", codec="org.apache.hadoop.io.compress.GzipCodec").save('s3a://myBucket/myPath')
Created ‎07-13-2016 08:58 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is your RDD an RDD of strings?
On the second part of the question, if you are using the spark-csv, the package supports saving simple (non-nested) DataFrame. There is an option to specify the delimiter which is , by default but can be changed.
eg - .save('filename.csv', 'com.databricks.spark.csv',delimiter="DELIM")
Created ‎07-14-2016 03:24 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Binu Mathew do you have any thoughts?
Created ‎07-14-2016 02:23 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you provide more details on the your RDD that you would like to save tab delimited? On the question about storing the DataFrames as a tab delimited file, below is what I have in scala using the package spark-csv
df.write.format("com.databricks.spark.csv").option("delimiter", "\t").save("output path")
EDIT With the RDD of tuples, as you mentioned, either you could join by "\t" on the tuple or use mkString if you prefer not to use an additional library. On your RDD of tuple you could do something like
.map { x =>x.productIterator.mkString("\t") }.saveAsTextFile("path-to-store")
Created ‎07-14-2016 04:04 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Essentially I have a python tuple ('a','b','c','x','y','z') that are all strings. I could just map them into a single concatenation of ('a\tb\tc\tx\ty\tz'), then saveAsTextFile(path). But I was wondering if there was a better way than using an external package which could just be encapsulating that .map(lambda x: "\t'".join(x) ).
Created ‎07-14-2016 06:48 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I guess, if the data set does not contain a '\t' char then '\t'.join and saveAsTextFile should work for you. Else, you just need to wrap the strings within " as with normal CSVs.
Created ‎07-18-2016 03:42 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try this, but this version is for version 1.5 and up
data.write.format('com.databricks.spark.csv').options(delimiter="\t", codec="org.apache.hadoop.io.compress.GzipCodec").save('s3a://myBucket/myPath')
