Support Questions

cjervis · ‎07-13-2016

I have a RDD I'd like to write as tab delimited. I also want to write a data frame as tab delimited. How do I do this?

doug_mengistu · ‎07-18-2016

Try this, but this version is for version 1.5 and up

data.write.format('com.databricks.spark.csv').options(delimiter="\t", codec="org.apache.hadoop.io.compress.GzipCodec").save('s3a://myBucket/myPath')

View solution in original post

arunak · ‎07-13-2016

Is your RDD an RDD of strings?

On the second part of the question, if you are using the spark-csv, the package supports saving simple (non-nested) DataFrame. There is an option to specify the delimiter which is , by default but can be changed.

eg - .save('filename.csv', 'com.databricks.spark.csv',delimiter="DELIM")

sunile_manjee · ‎07-14-2016

@Binu Mathew do you have any thoughts?

arunak · ‎07-14-2016

Could you provide more details on the your RDD that you would like to save tab delimited? On the question about storing the DataFrames as a tab delimited file, below is what I have in scala using the package spark-csv

df.write.format("com.databricks.spark.csv").option("delimiter", "\t").save("output path")

EDIT With the RDD of tuples, as you mentioned, either you could join by "\t" on the tuple or use mkString if you prefer not to use an additional library. On your RDD of tuple you could do something like

.map { x =>x.productIterator.mkString("\t") }.saveAsTextFile("path-to-store")

@Don Jernigan

don_jernigan · ‎07-14-2016

Essentially I have a python tuple ('a','b','c','x','y','z') that are all strings. I could just map them into a single concatenation of ('a\tb\tc\tx\ty\tz'), then saveAsTextFile(path). But I was wondering if there was a better way than using an external package which could just be encapsulating that .map(lambda x: "\t'".join(x) ).

arunak · ‎07-14-2016

I guess, if the data set does not contain a '\t' char then '\t'.join and saveAsTextFile should work for you. Else, you just need to wrap the strings within " as with normal CSVs.

doug_mengistu · ‎07-18-2016

Try this, but this version is for version 1.5 and up

data.write.format('com.databricks.spark.csv').options(delimiter="\t", codec="org.apache.hadoop.io.compress.GzipCodec").save('s3a://myBucket/myPath')

Cloudera Community

Support Questions

How do you write a RDD as a tab delimited file in pyspark?

Impala writes on Iceberg

Write / Read Parquet File in Spark

Spark RDDs vs DataFrames vs SparkSQL

Using VirtualEnv with PySpark

Tab Delimited File as Upload Table not able to rec...

How to Integrate CDE with COD and Reading & Writin...

Spark to parse Weblogs text files and write output...

Pyspark Streaming Wordcount Example

Writing files to Cloudera Machine Learning using A...

Spark (PySpark) to extract from SQL Server