Member since
03-26-2017
61
Posts
1
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3014 | 08-27-2018 03:19 PM | |
24976 | 08-27-2018 03:18 PM | |
9573 | 04-02-2018 01:54 PM |
11-27-2018
08:10 AM
Hi @Felix Albani, Tanx for your response. I reffered this site i'm thinking my issues is related to the story. Cheers, MJ
... View more
11-26-2018
10:20 AM
Hi All, I'm trying to Write 430000 records into Avro file, im using the following Avro Writer in my dependency <dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-avro</artifactId>
<version>1.9.0</version>
</dependency> the file writing successfully completed, but when i try to read the data from Spark using any avro supporting library like Databrics avro, SparkAvro and Apache Avro im getting below Error: (But one important thing is still 200000 i can read my data without the error) 18/11/26 15:48:23 ERROR TaskSetManager: Task 3 in stage 0.0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 1 times, most recent failure: Lost task 3.0 in stage 0.0 (TID 3, localhost, executor driver): org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
at org.apache.spark.sql.avro.AvroFileFormat$anonfun$buildReader$1$anon$1.hasNext(AvroFileFormat.scala:202)
at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.execution.datasources.FileScanRDD$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$anonfun$11$anon$1.hasNext(WholeStageCodegenExec.scala:619)
at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Invalid sync!
at org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:297)
at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:198)
... 18 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$failJobAndIndependentStages(DAGScheduler.scala:1887)
at org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)
at org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1874)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1874)
at org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2108)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
at org.apache.spark.util.EventLoop$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$anonfun$collect$1.apply(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)
at org.apache.spark.sql.Dataset$anonfun$count$1.apply(Dataset.scala:2831)
at org.apache.spark.sql.Dataset$anonfun$count$1.apply(Dataset.scala:2830)
at org.apache.spark.sql.Dataset$anonfun$53.apply(Dataset.scala:3365)
at org.apache.spark.sql.execution.SQLExecution$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
at org.apache.spark.sql.Dataset.count(Dataset.scala:2830)
... 49 elided
Caused by: org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
at org.apache.spark.sql.avro.AvroFileFormat$anonfun$buildReader$1$anon$1.hasNext(AvroFileFormat.scala:202)
at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.execution.datasources.FileScanRDD$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$anonfun$11$anon$1.hasNext(WholeStageCodegenExec.scala:619)
at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Invalid sync!
at org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:297)
at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:198)
... 18 more Even from Hive i can run select * from statement but i cannot perform count(*) operation Help me out on this issue Cheers MJ (+91) - 9688 514 443
... View more
Labels:
- Labels:
-
Apache Spark
09-17-2018
05:49 AM
Hi All,
we know there are formulas available to deteremine Spark job "Executor memory" and "number of Executor" and "executor cores" based on your cluster available Resources, is there any formula available to calculate the same alone with Data size.
case 1: what is the configuration if: data size < 5 GB case 2: what is the configuration if: 5 GB > data size < 10 GB case 3: what is the configuration if: 10 GB > data size < 15 GB case 4: what is the configuration if: 15 GB > data size < 25 GB case 5: what is the configuration if: data size < 25 GB Cheers, MJ
... View more
Labels:
- Labels:
-
Apache Spark
09-17-2018
05:46 AM
Hi @vgarg, Tanx for your reply, i dont think so case statement would helpful for me, coz i need to make this change at globally across all my systems. Cheers, MJ
... View more
09-04-2018
06:25 AM
Hi All, I have Hive table with ORC format, an i have some records stored as Null in the table, is there any configuration in hive to Set Null as Empty string while Querying the data ? Cheers, MJ
... View more
Labels:
- Labels:
-
Apache Hive
08-27-2018
03:19 PM
Hi Issue got resolved, i'm trying to perform Group by operation inside a Columns literal, group by itself will produce a new columns instead writing a query like i asked above we have to change our query accordingly as follow. inputDf = inputDf.groupBy(col("tableName"),col("runDate")) .agg(sum(when(col("MinBusinessDate")< col("runDate")&& col("MaxBusinessDate")> col("runDate"), when(col("business_date")> col("runDate"), col("rowCount")))).alias("NewColumnName"))
... View more
08-27-2018
03:18 PM
Hi Issue got resolved, i'm trying to perform Group by operation inside a Columns literal, group by itself will produce a new columns instead writing a query like i asked above we have to change our query accordingly as follow. inputDf = inputDf.groupBy(col("tableName"),col("runDate")) .agg(sum(when(col("MinBusinessDate") < col("runDate") && col("MaxBusinessDate") > col("runDate"), when(col("business_date") > col("runDate"), col("rowCount")))).alias("NewColumnName"))
... View more
08-26-2018
02:55 PM
Hi All, I'm trying to add a column to a dataframe based on multiple check condition, one of the operation that we are doing is we need to take sum of rows, but im getting Below error: Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Dataset [StorageDayCountBeore: double]
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77)
at org.apache.spark.sql.catalyst.expressions.Literal$anonfun$create$2.apply(literals.scala:163)
at org.apache.spark.sql.catalyst.expressions.Literal$anonfun$create$2.apply(literals.scala:163)
at scala.util.Try.getOrElse(Try.scala:79)
at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:162)
at org.apache.spark.sql.functions$.typedLit(functions.scala:112)
at org.apache.spark.sql.functions$.lit(functions.scala:95)
at MYDev.ReconTest$.main(ReconTest.scala:35)
at MYDev.ReconTest.main(ReconTest.scala)
and the Query im using is: var df = inputDf df = df.persist() inputDf = inputDf.withColumn("newColumn", when(df("MinBusinessDate") < "2018-08-8" && df("MaxBusinessDate") > "2018-08-08", lit(df.groupBy(df("tableName"),df("runDate")) .agg(sum(when(df("business_date") > "2018-08-08", df("rowCount"))) .alias("finalSRCcount")) .drop("tableName","runDate")))) Cheers, MJ
... View more
Labels:
- Labels:
-
Apache Spark
08-26-2018
02:53 PM
Hi All, Im trying to add a column to a dataframe based on multiple check condition, one of the operation that we are doing is we need to take sum of rows, but im getting Below error: Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Dataset [Column_12: double]
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77)
at org.apache.spark.sql.catalyst.expressions.Literal$anonfun$create$2.apply(literals.scala:163)
at org.apache.spark.sql.catalyst.expressions.Literal$anonfun$create$2.apply(literals.scala:163)
at scala.util.Try.getOrElse(Try.scala:79)
at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:162)
at org.apache.spark.sql.functions$.typedLit(functions.scala:112)
at org.apache.spark.sql.functions$.lit(functions.scala:95)
at MYDev.ReconTest$.main(ReconTest.scala:35)
at MYDev.ReconTest.main(ReconTest.scala) and the Query im using is: var df = inputDf df = df.persist() inputDf = inputDf.withColumn("newColumn", when(df("MinBusinessDate") < "2018-08-8" && df("MaxBusinessDate") > "2018-08-08", lit(df.groupBy(df("tableName"),df("runDate")) .agg(sum(when(df("business_date") > "2018-08-08", df("rowCount"))) .alias("finalSRCcount")) .drop("tableName","runDate")))) Cheers, MJ
... View more
Labels:
- Labels:
-
Apache Spark
08-25-2018
06:40 AM
Hi @Felix Albani yes your suggestion works fine, i think i have to extend my object as app to make it work. Tanx Bro.
... View more