Member since
03-21-2017
18
Posts
2
Kudos Received
0
Solutions
06-25-2020
02:48 AM
I will check our spark 2.4.5 application code compatibility with spark 2.3.2 version. Is Ambari & HDP going to be discontinued in near future as part of cloudera and hortonworks merger going ? We need to plan our choice of softwares accordingly.
... View more
08-13-2017
02:39 PM
Use mapPartitions if we want to add header in all files or if there is single partition. topPriceResultsDF
.map(x => x.mkString(","))
.mapPartitions(iter => Iterator(header) ++ iter)
.saveAsTextFile("/user/sparkuser/myspark/data/output/yahoo_above40resultsWithHeader.csv") Use mapPartitionsWithIndex if we want to add header in only first file topPriceResultsDF.map(x => x.mkString(","))
.repartition(2)
.mapPartitionsWithIndex ({
case (0, iter) => Iterator(header) ++ iter
case (_, iter) => iter
})
.saveAsTextFile("/user/sparkuser/myspark/data/output/yahoo_above40resultsWithHeader.csv")
... View more
03-30-2017
12:28 PM
3 Kudos
Unfortunately the dataset is not in a simple field-delimited format, ie. where each line is a record consisting of fields separated by a delimiter like comma, pipe, or tab. If it were, you could define the delimiter on LOAD with USING PigStorage('delim') where delim would be an actual delimiter like , or | or \t. The million song data is structured in a HDF5 format, which is a complex hierarchical structure with both metadata and field data. See https://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/FileSchema.pdf You need to use a wrapper API to work with it:
https://labrosa.ee.columbia.edu/millionsong/pages/hdf-what https://support.hdfgroup.org/downloads/ In your case, you would need to use the wrapper API to iterate the data and output it into a delimited format. Then you could load it to pig as described above. In addition to the above links, this link is generally useful for your data set: https://labrosa.ee.columbia.edu/millionsong/faq
... View more
10-11-2017
12:50 AM
@Jay SenSharma, @Shalini Goel - does this change have any impact on functioning of Atlas in the cluster ? As per HDP doc, we need to have the following: hive.exec.post.hooks=org.apache.hadoop.hive.ql.hooks.ATSHook, org.apache.atlas.hive.hook.HiveHook
... View more