About shalini_goel

shalini_goel · ‎06-25-2020

I will check our spark 2.4.5 application code compatibility with spark 2.3.2 version. Is Ambari & HDP going to be discontinued in near future as part of cloudera and hortonworks merger going ? We need to plan our choice of softwares accordingly.

shalini_goel · ‎08-13-2017

Use mapPartitions if we want to add header in all files or if there is single partition. topPriceResultsDF .map(x => x.mkString(",")) .mapPartitions(iter => Iterator(header) ++ iter) .saveAsTextFile("/user/sparkuser/myspark/data/output/yahoo_above40resultsWithHeader.csv") Use mapPartitionsWithIndex if we want to add header in only first file topPriceResultsDF.map(x => x.mkString(",")) .repartition(2) .mapPartitionsWithIndex ({ case (0, iter) => Iterator(header) ++ iter case (_, iter) => iter }) .saveAsTextFile("/user/sparkuser/myspark/data/output/yahoo_above40resultsWithHeader.csv")

gkeys · ‎03-30-2017

Unfortunately the dataset is not in a simple field-delimited format, ie. where each line is a record consisting of fields separated by a delimiter like comma, pipe, or tab. If it were, you could define the delimiter on LOAD with USING PigStorage('delim') where delim would be an actual delimiter like , or | or \t. The million song data is structured in a HDF5 format, which is a complex hierarchical structure with both metadata and field data. See https://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/FileSchema.pdf You need to use a wrapper API to work with it: https://labrosa.ee.columbia.edu/millionsong/pages/hdf-what https://support.hdfgroup.org/downloads/ In your case, you would need to use the wrapper API to iterate the data and output it into a delimited format. Then you could load it to pig as described above. In addition to the above links, this link is generally useful for your data set: https://labrosa.ee.columbia.edu/millionsong/faq

dineshc · ‎10-11-2017

@Jay SenSharma, @Shalini Goel - does this change have any impact on functioning of Atlas in the cluster ? As per HDP doc, we need to have the following: hive.exec.post.hooks=org.apache.hadoop.hive.ql.hooks.ATSHook, org.apache.atlas.hive.hook.HiveHook

Online	Offline
Last Visited	‎10-19-2023 07:04 AM

Member Since	‎03-21-2017 06:26 AM
Last Visited	‎10-19-2023 07:04 AM
Posts	18
Kudos received	2

Cloudera Community

Re: Hadoop 3.1.0 and Spark 2.4.5 installation usin...

Re: How to append column header to Spark SQL query...

Re: Load MillionSongsSubset data in Pig

Re: current_database() function not working in Hiv...