About vadivel_samband

vadivel_samband · ‎07-20-2017

@Bala Vignesh N V the issue is first() method returns a string not a Rdd. Subtract will works within two rdd's. So u should convert tagsheader to rdd by using parallelize. tags = sc.textFile("hdfs:///data/spark/genome-tags.csv") tagsheader = tags.first() header = sc.parallelize([tagsheader]) tagsdata = tags.subtract(header)

vadivel_samband · ‎04-26-2016

While loading file from hdfs to RDD how data splitting happend across partitons. is there anything like hadoop input split ?

Online	Offline
Last Visited	‎01-09-2019 06:45 AM

Member Since	‎03-31-2016 04:52 AM
Last Visited	‎01-09-2019 06:45 AM
Posts	5
Kudos received	3

Cloudera Community

Re: Removing header from CSV file through pyspark

How split calculate in Spark ?