About manikandanjeyab

manikandanjeyab · ‎11-27-2018

Hi @Felix Albani, Tanx for your response. I reffered this site i'm thinking my issues is related to the story. Cheers, MJ

manikandanjeyab · ‎11-26-2018

Hi All, I'm trying to Write 430000 records into Avro file, im using the following Avro Writer in my dependency <dependency> <groupId>org.apache.parquet</groupId> <artifactId>parquet-avro</artifactId> <version>1.9.0</version> </dependency> the file writing successfully completed, but when i try to read the data from Spark using any avro supporting library like Databrics avro, SparkAvro and Apache Avro im getting below Error: (But one important thing is still 200000 i can read my data without the error) 18/11/26 15:48:23 ERROR TaskSetManager: Task 3 in stage 0.0 failed 1 times; aborting job org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 1 times, most recent failure: Lost task 3.0 in stage 0.0 (TID 3, localhost, executor driver): org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync! at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210) at org.apache.spark.sql.avro.AvroFileFormat$anonfun$buildReader$1$anon$1.hasNext(AvroFileFormat.scala:202) at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.sql.execution.datasources.FileScanRDD$anon$1.hasNext(FileScanRDD.scala:101) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$anonfun$11$anon$1.hasNext(WholeStageCodegenExec.scala:619) at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$anonfun$10.apply(Executor.scala:402) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Invalid sync! at org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:297) at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:198) ... 18 more Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$failJobAndIndependentStages(DAGScheduler.scala:1887) at org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1875) at org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1874) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1874) at org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) at org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2108) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046) at org.apache.spark.util.EventLoop$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126) at org.apache.spark.rdd.RDD$anonfun$collect$1.apply(RDD.scala:945) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.collect(RDD.scala:944) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299) at org.apache.spark.sql.Dataset$anonfun$count$1.apply(Dataset.scala:2831) at org.apache.spark.sql.Dataset$anonfun$count$1.apply(Dataset.scala:2830) at org.apache.spark.sql.Dataset$anonfun$53.apply(Dataset.scala:3365) at org.apache.spark.sql.execution.SQLExecution$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at org.apache.spark.sql.Dataset.count(Dataset.scala:2830) ... 49 elided Caused by: org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync! at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210) at org.apache.spark.sql.avro.AvroFileFormat$anonfun$buildReader$1$anon$1.hasNext(AvroFileFormat.scala:202) at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.sql.execution.datasources.FileScanRDD$anon$1.hasNext(FileScanRDD.scala:101) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$anonfun$11$anon$1.hasNext(WholeStageCodegenExec.scala:619) at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$anonfun$10.apply(Executor.scala:402) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Invalid sync! at org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:297) at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:198) ... 18 more Even from Hive i can run select * from statement but i cannot perform count(*) operation Help me out on this issue Cheers MJ (+91) - 9688 514 443

manikandanjeyab · ‎09-17-2018

Hi All, we know there are formulas available to deteremine Spark job "Executor memory" and "number of Executor" and "executor cores" based on your cluster available Resources, is there any formula available to calculate the same alone with Data size. case 1: what is the configuration if: data size < 5 GB case 2: what is the configuration if: 5 GB > data size < 10 GB case 3: what is the configuration if: 10 GB > data size < 15 GB case 4: what is the configuration if: 15 GB > data size < 25 GB case 5: what is the configuration if: data size < 25 GB Cheers, MJ

manikandanjeyab · ‎09-17-2018

Hi @vgarg, Tanx for your reply, i dont think so case statement would helpful for me, coz i need to make this change at globally across all my systems. Cheers, MJ

manikandanjeyab · ‎09-04-2018

Hi All, I have Hive table with ORC format, an i have some records stored as Null in the table, is there any configuration in hive to Set Null as Empty string while Querying the data ? Cheers, MJ

manikandanjeyab · ‎08-27-2018

Hi Issue got resolved, i'm trying to perform Group by operation inside a Columns literal, group by itself will produce a new columns instead writing a query like i asked above we have to change our query accordingly as follow. inputDf = inputDf.groupBy(col("tableName"),col("runDate")) .agg(sum(when(col("MinBusinessDate")< col("runDate")&& col("MaxBusinessDate")> col("runDate"), when(col("business_date")> col("runDate"), col("rowCount")))).alias("NewColumnName"))

manikandanjeyab · ‎08-27-2018

Hi Issue got resolved, i'm trying to perform Group by operation inside a Columns literal, group by itself will produce a new columns instead writing a query like i asked above we have to change our query accordingly as follow. inputDf = inputDf.groupBy(col("tableName"),col("runDate")) .agg(sum(when(col("MinBusinessDate") < col("runDate") && col("MaxBusinessDate") > col("runDate"), when(col("business_date") > col("runDate"), col("rowCount")))).alias("NewColumnName"))

manikandanjeyab · ‎08-26-2018

Hi All, I'm trying to add a column to a dataframe based on multiple check condition, one of the operation that we are doing is we need to take sum of rows, but im getting Below error: Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Dataset [StorageDayCountBeore: double] at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77) at org.apache.spark.sql.catalyst.expressions.Literal$anonfun$create$2.apply(literals.scala:163) at org.apache.spark.sql.catalyst.expressions.Literal$anonfun$create$2.apply(literals.scala:163) at scala.util.Try.getOrElse(Try.scala:79) at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:162) at org.apache.spark.sql.functions$.typedLit(functions.scala:112) at org.apache.spark.sql.functions$.lit(functions.scala:95) at MYDev.ReconTest$.main(ReconTest.scala:35) at MYDev.ReconTest.main(ReconTest.scala) and the Query im using is: var df = inputDf df = df.persist() inputDf = inputDf.withColumn("newColumn", when(df("MinBusinessDate") < "2018-08-8" && df("MaxBusinessDate") > "2018-08-08", lit(df.groupBy(df("tableName"),df("runDate")) .agg(sum(when(df("business_date") > "2018-08-08", df("rowCount"))) .alias("finalSRCcount")) .drop("tableName","runDate")))) Cheers, MJ

manikandanjeyab · ‎08-26-2018

Hi All, Im trying to add a column to a dataframe based on multiple check condition, one of the operation that we are doing is we need to take sum of rows, but im getting Below error: Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Dataset [Column_12: double] at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77) at org.apache.spark.sql.catalyst.expressions.Literal$anonfun$create$2.apply(literals.scala:163) at org.apache.spark.sql.catalyst.expressions.Literal$anonfun$create$2.apply(literals.scala:163) at scala.util.Try.getOrElse(Try.scala:79) at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:162) at org.apache.spark.sql.functions$.typedLit(functions.scala:112) at org.apache.spark.sql.functions$.lit(functions.scala:95) at MYDev.ReconTest$.main(ReconTest.scala:35) at MYDev.ReconTest.main(ReconTest.scala) and the Query im using is: var df = inputDf df = df.persist() inputDf = inputDf.withColumn("newColumn", when(df("MinBusinessDate") < "2018-08-8" && df("MaxBusinessDate") > "2018-08-08", lit(df.groupBy(df("tableName"),df("runDate")) .agg(sum(when(df("business_date") > "2018-08-08", df("rowCount"))) .alias("finalSRCcount")) .drop("tableName","runDate")))) Cheers, MJ

manikandanjeyab · ‎08-25-2018

Hi @Felix Albani yes your suggestion works fine, i think i have to extend my object as app to make it work. Tanx Bro.

Online	Offline
Last Visited	‎09-18-2019 09:50 AM

Member Since	‎03-26-2017 12:39 PM
Last Visited	‎09-18-2019 09:50 AM
Posts	61
Kudos received	1

Cloudera Community

Re: spark scala Dataframe adding new column Error

Re: Unsupported literal type class in Apache Spark...

Re: Spark SQL Error connecting Hive Metastore

Re: Spark throws "Invalid Sync" Error when trying ...

Spark throws "Invalid Sync" Error when trying to R...

how to configure spark submit configurations based...

Re: Hive set Null value to Empty string

Hive set Null value to Empty string

Re: spark scala Dataframe adding new column Error

Re: Unsupported literal type class in Apache Spark...

Unsupported literal type class in Apache Spark in ...

spark scala Dataframe adding new column Error

Re: spark + scala schema creattion error