About bigspark

bigspark · ‎05-14-2019

Can you please try removing the master("local[*]") from the spark code and pass it as a parameter in spark submit -- master yarn --deploy-mode cluster.. It should work..

bigspark · ‎02-21-2018

@Pavan Kumar KondaIt depends on lot of constraints like compression, serialization and whether the storage format is splittable etc. I think ORC is just for hive. Avro Avro is language neutral data serialization Writables has the drawback that they do not provide language portability. Avro formatted data can be described through language independent schema. Hence Avro formatted data can be shared across applications using different languages. Avro stores the schema in header of file so data is self-describing. Avro formatted files are splittable and compressible and hence it’s a good candidate for data storage in Hadoop ecosystem. Schema Evolution – Schema used to read a Avro file need not be same as schema which was used to write the files. This makes it possible to add new fields. Parquet Parquet is a columnar format. Columnar formats works well where only few columns are required in query/ analysis. Only required columns would be fetched / read, it reduces the disk I/O. Parquet is well suited for data warehouse kind of solutions where aggregations are required on certain column over a huge set of data. Parquet provides very good compression upto 75% when used with compression formats like snappy. Parquet can be read and write using Avro API and Avro Schema. It also provides predicate pushdown, thus reducing further disk I/O cost.

bigspark · ‎02-09-2017

Hi Tibor thanks for your reply. I have looked at the above link. I have a dataset of structure i.e. .gz file which iam reading in spark abcdefghij; abc=1234 xyz=987 abn=567 ubg=345 after pivot abcdefghij abn ubg abc xyz abcdefghij 567 987 1234 987 and so on. All the above columns are string columns and abn values are duplicated. So as they are strings and iam just looking to split the data thats why i dont need aggregation, i just need pivot. Any ideas to acheive this in spark scala? thank u

bigspark · ‎02-09-2017

Iam looking to perform spark pivot without aggregation, is it really possible to use the spark 2.1.0 api and generate a pivot without aggregation. I tried but as the pivot returns groupeddataset without aggregation the api is not working for me. Any ideas how to convert to DF or dataset without performing aggregation and show the DF please. Thank you

bigspark · ‎02-02-2017

Just wondering if spark supports Reading *.gz files from an s3 bucket or dir as a Dataframe or Dataset.. I think we can read as RDD but its still not working for me. Any help would be appreciated. Thank you. iam using s3n://.. but spark says invalid input path exception. val df = spark.sparkContext.textFile("s3n://..../*.gz) doesnt work for me 😞 I prefer to the s3 dir of .gz files as a DF or Dataset if possible else atleast RDD please. thank you

bigspark · ‎12-21-2016

@Mridul M Thanks for the reply. Actually i was using the old version of spark testing base. Didn't know there was a 2.0.2 version. Holden karau pointed me to maven central where i found the latest version of spark testing base. In github readme of spark testing base it was mentioned 1.6 so i assumed the latest version was 1.6. But now its sorted and spark 2.0.2 Dataframe testing works for me. In case people might need this it also needs Hive dependency. Currently its new and couldnt find forums in spark 2.0 testing so iam posting this might help save time for other developers 🙂

bigspark · ‎12-20-2016

A needed class was not found. This could be due to an error in your runpath. Missing class: org/apache/spark/Logging java.lang.NoClassDefFoundError: org/apache/spark/Logging https://community.hortonworks.com/questions/58286/noclassdeffounderror-orgapachesparklogging-using-s.html Looked at the above question i think its valid for spark prior to 2.0 but iam looking for 2.0.2 🙂 I tried adding dependencies in sbt for Spark streaming, twitter and many more. Still i get this error for spark 2.0.2. Any ideas how to get this resolved? Thank you

bigspark · ‎12-08-2016

@Bernhard Walter thanks for the reply.

bigspark · ‎12-06-2016

Just wondering if we write a inner join query in plain sql or if we use dataframe api to perform join, do we get the same performance? I can see that the auery and dataframe are pushed to catalyst optimiser from the diagram but just wanted to confirm if i write plain sql queries can i use it for bigdata production usecases or shall i use dataframe for better performance? thank u

bigspark · ‎08-31-2016

Is there any way to monitor the throughput of Nifi i.e. how many messages it has processed in a second? Apart from Nifi UI and Ambari is there a way to monitor the throughput for Nifi please like logs as i cant find any detials regarding throughput in nifi logs or in any of the nifi repositories(content, flowfile,db) as well?

Online	Offline
Last Visited	‎10-19-2020 08:16 AM

Member Since	‎09-25-2015 09:05 AM
Last Visited	‎10-19-2020 08:16 AM
Posts	112
Kudos received	37

Cloudera Community

Re: Spark Testing base 1.6.1_0.3.3 for spark versi...

Re: “User did not initialize spark context” Error ...

Re: Between Avro, Parquet, and RC/ORC which is use...

Re: Performing Spark pivot without aggregation?

Performing Spark pivot without aggregation?

spark 2.1.0 Reading *.gz files from an s3 bucket o...

Re: Spark Testing base 1.6.1_0.3.3 for spark versi...

Spark Testing base 1.6.1_0.3.3 for spark version 2...

Re: Spark SQL 2.0 - performance of Plain SQL query...

Spark SQL 2.0 - performance of Plain SQL query in ...

Monitoring Nifi Throughput apart from UI any logs ...