About arunak

arunak · ‎07-15-2016

could you please post little more information on the job, the submit command etc. What is your data source?

arunak · ‎07-14-2016

I guess, if the data set does not contain a '\t' char then '\t'.join and saveAsTextFile should work for you. Else, you just need to wrap the strings within " as with normal CSVs.

arunak · ‎07-14-2016

Could you provide more details on the your RDD that you would like to save tab delimited? On the question about storing the DataFrames as a tab delimited file, below is what I have in scala using the package spark-csv df.write.format("com.databricks.spark.csv").option("delimiter", "\t").save("output path") EDIT With the RDD of tuples, as you mentioned, either you could join by "\t" on the tuple or use mkString if you prefer not to use an additional library. On your RDD of tuple you could do something like .map { x =>x.productIterator.mkString("\t") }.saveAsTextFile("path-to-store") @Don Jernigan

arunak · ‎07-13-2016

Is your RDD an RDD of strings? On the second part of the question, if you are using the spark-csv, the package supports saving simple (non-nested) DataFrame. There is an option to specify the delimiter which is , by default but can be changed. eg - .save('filename.csv', 'com.databricks.spark.csv',delimiter="DELIM")

arunak · ‎06-08-2016

Difference is noticeable only when we run it in a cluster mode without actually knowing where the driver is. On the other case, if we know where the driver is set to launch, both methods are similar in action. --files is a submit time parameter, main() can run anywhere and just need to know the file name. In code, I can refer to the file by a file:// call. In case of addFile(), since this is a code level setting, the main() need to know the file location in order to perform the add() . As per the API doc, The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

arunak · ‎06-08-2016

Thank You @clukasik and @Jitendra Yadav. Appreciate your help.

arunak · ‎06-08-2016

@Benjamin Leonhardi. Thanks for pointing this out. I over looked this flag.

arunak · ‎06-08-2016

@Rajkumar Singh, don't the application.properties.file need to be in a key value format?

arunak · ‎06-08-2016

Thanks @Jitendra Yadav. I will take a look at the addFile API. I would like to try getting control on the driver as clukasik pointed out.

arunak · ‎06-08-2016

@clukasik, Thank You, I have had a look at broadcast variables. But I guess with the current requirement, I just require the RDD.

Online	Offline
Last Visited	‎01-10-2020 08:56 AM

Member Since	‎05-17-2016 11:59 AM
Last Visited	‎01-10-2020 08:56 AM
Posts	190
Kudos received	46

Cloudera Community

Re: Composed delimiter , multidilimiter in Hive !!...

Re: How to put running log of Apahce NiFi into Spl...

Re: How to extract Text from JSON

Re: How to expand a single row with a start and en...

Re: Enabling LZO compression using NiFi PutHDFS

Re: spark issue after ran the job

Re: How do you write a RDD as a tab delimited file...

Re: How do you write a RDD as a tab delimited file...

Re: How do you write a RDD as a tab delimited file...

Re: Loading Local File to Apache Spark

Re: Loading Local File to Apache Spark

Re: Loading Local File to Apache Spark

Re: Loading Local File to Apache Spark

Re: Loading Local File to Apache Spark

Re: Loading Local File to Apache Spark