Member since
09-25-2015
112
Posts
37
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1985 | 12-21-2016 09:31 AM |
05-14-2019
12:29 PM
Can you please try removing the master("local[*]") from the spark code and pass it as a parameter in spark submit -- master yarn --deploy-mode cluster.. It should work..
... View more
02-21-2018
09:56 AM
@Pavan Kumar KondaIt depends on lot of constraints like compression, serialization and whether the storage format is splittable etc. I think ORC is just for hive. Avro
Avro is language neutral data serialization Writables has the drawback that they do not provide language portability. Avro formatted data can be described through language independent schema. Hence Avro formatted data can be shared across applications using different languages. Avro stores the schema in header of file so data is self-describing. Avro formatted files are splittable and compressible and hence it’s a good candidate for data storage in Hadoop ecosystem. Schema Evolution – Schema used to read a Avro file need not be same as schema which was used to write the files. This makes it possible to add new fields. Parquet Parquet
is a columnar format.
Columnar formats works well where only few columns are required in query/
analysis.
Only
required columns would be fetched / read, it reduces the disk I/O.
Parquet
is well suited for data warehouse kind of solutions where aggregations are
required on certain column over a huge set of data.
Parquet
provides very good compression upto 75% when used with compression formats
like snappy.
Parquet
can be read and write using Avro API and Avro Schema.
It
also provides predicate pushdown,
thus reducing further disk I/O cost.
... View more
02-09-2017
09:03 PM
Hi Tibor thanks for your reply. I have looked at the above link. I have a dataset of structure i.e. .gz file which iam reading in spark abcdefghij; abc=1234 xyz=987 abn=567 ubg=345 after pivot abcdefghij abn ubg abc xyz abcdefghij 567 987 1234 987 and so on. All the above columns are string columns and abn values are duplicated. So as they are strings and iam just looking to split the data thats why i dont need aggregation, i just need pivot. Any ideas to acheive this in spark scala? thank u
... View more
02-09-2017
05:06 PM
Iam looking to perform spark pivot without aggregation, is it really possible to use the spark 2.1.0 api and generate a pivot without aggregation. I tried but as the pivot returns groupeddataset without aggregation the api is not working for me. Any ideas how to convert to DF or dataset without performing aggregation and show the DF please. Thank you
... View more
Labels:
- Labels:
-
Apache Spark
02-02-2017
05:59 PM
1 Kudo
Just wondering if spark supports Reading *.gz files from an s3 bucket or dir as a Dataframe or Dataset.. I think we can read as RDD but its still not working for me. Any help would be appreciated. Thank you. iam using s3n://.. but spark says invalid input path exception. val df = spark.sparkContext.textFile("s3n://..../*.gz) doesnt work for me 😞 I prefer to the s3 dir of .gz files as a DF or Dataset if possible else atleast RDD please. thank you
... View more
Labels:
- Labels:
-
Apache Spark
12-21-2016
09:31 AM
@Mridul M Thanks for the reply. Actually i was using the old version of spark testing base. Didn't know there was a 2.0.2 version. Holden karau pointed me to maven central where i found the latest version of spark testing base. In github readme of spark testing base it was mentioned 1.6 so i assumed the latest version was 1.6. But now its sorted and spark 2.0.2 Dataframe testing works for me. In case people might need this it also needs Hive dependency. Currently its new and couldnt find forums in spark 2.0 testing so iam posting this might help save time for other developers 🙂
... View more
12-20-2016
11:43 AM
A needed class was not found. This could be due to an error in your runpath. Missing class: org/apache/spark/Logging
java.lang.NoClassDefFoundError: org/apache/spark/Logging https://community.hortonworks.com/questions/58286/noclassdeffounderror-orgapachesparklogging-using-s.html Looked at the above question i think its valid for spark prior to 2.0 but iam looking for 2.0.2 🙂 I tried adding dependencies in sbt for Spark streaming, twitter and many more. Still i get this error for spark 2.0.2. Any ideas how to get this resolved? Thank you
... View more
Labels:
- Labels:
-
Apache Spark
12-08-2016
10:34 AM
@Bernhard Walter thanks for the reply.
... View more
12-06-2016
05:33 PM
1 Kudo
Just wondering if we write a inner join query in plain sql or if we use dataframe api to perform join, do we get the same performance? I can see that the auery and dataframe are pushed to catalyst optimiser from the diagram but just wanted to confirm if i write plain sql queries can i use it for bigdata production usecases or shall i use dataframe for better performance? thank u
... View more
Labels:
- Labels:
-
Apache Spark
08-31-2016
09:37 AM
1 Kudo
Is there any way to monitor the throughput of Nifi i.e. how many messages it has processed in a second? Apart from Nifi UI and Ambari is there a way to monitor the throughput for Nifi please like logs as i cant find any detials regarding throughput in nifi logs or in any of the nifi repositories(content, flowfile,db) as well?
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache NiFi