Member since 
    
	
		
		
		03-26-2017
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                61
            
            
                Posts
            
        
                1
            
            
                Kudos Received
            
        
                3
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 3841 | 08-27-2018 03:19 PM | |
| 27784 | 08-27-2018 03:18 PM | |
| 11423 | 04-02-2018 01:54 PM | 
			
    
	
		
		
		11-27-2018
	
		
		08:10 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi @Felix Albani, Tanx for your response.  I reffered this site i'm thinking my issues is related to the story.  Cheers,  MJ 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		11-26-2018
	
		
		10:20 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi All,  I'm trying to Write 430000 records into Avro file, im using the following Avro Writer in my dependency  <dependency>  
 <groupId>org.apache.parquet</groupId>  
 <artifactId>parquet-avro</artifactId>  
 <version>1.9.0</version>  
</dependency>  the file writing successfully completed, but when i try to read the data from Spark using any avro supporting library like  Databrics avro, SparkAvro and Apache Avro im getting below Error: (But one important thing is still 200000 i can read my data without the error)  18/11/26 15:48:23 ERROR TaskSetManager: Task 3 in stage 0.0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 1 times, most recent failure: Lost task 3.0 in stage 0.0 (TID 3, localhost, executor driver): org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
        at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
        at org.apache.spark.sql.avro.AvroFileFormat$anonfun$buildReader$1$anon$1.hasNext(AvroFileFormat.scala:202)
        at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$anon$1.hasNext(FileScanRDD.scala:101)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$anonfun$11$anon$1.hasNext(WholeStageCodegenExec.scala:619)
        at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$anonfun$10.apply(Executor.scala:402)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Invalid sync!
        at org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:297)
        at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:198)
        ... 18 more
Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$failJobAndIndependentStages(DAGScheduler.scala:1887)
  at org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)
  at org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1874)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1874)
  at org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
  at org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2108)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
  at org.apache.spark.util.EventLoop$anon$1.run(EventLoop.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
  at org.apache.spark.rdd.RDD$anonfun$collect$1.apply(RDD.scala:945)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
  at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)
  at org.apache.spark.sql.Dataset$anonfun$count$1.apply(Dataset.scala:2831)
  at org.apache.spark.sql.Dataset$anonfun$count$1.apply(Dataset.scala:2830)
  at org.apache.spark.sql.Dataset$anonfun$53.apply(Dataset.scala:3365)
  at org.apache.spark.sql.execution.SQLExecution$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
  at org.apache.spark.sql.Dataset.count(Dataset.scala:2830)
  ... 49 elided
Caused by: org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
  at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
  at org.apache.spark.sql.avro.AvroFileFormat$anonfun$buildReader$1$anon$1.hasNext(AvroFileFormat.scala:202)
  at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:409)
  at org.apache.spark.sql.execution.datasources.FileScanRDD$anon$1.hasNext(FileScanRDD.scala:101)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
  at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at org.apache.spark.sql.execution.WholeStageCodegenExec$anonfun$11$anon$1.hasNext(WholeStageCodegenExec.scala:619)
  at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:409)
  at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
  at org.apache.spark.scheduler.Task.run(Task.scala:121)
  at org.apache.spark.executor.Executor$TaskRunner$anonfun$10.apply(Executor.scala:402)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Invalid sync!
  at org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:297)
  at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:198)
  ... 18 more  Even from Hive i can run select * from statement but i cannot perform count(*) operation  Help me out on this issue  Cheers  MJ  (+91) - 9688 514 443 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Apache Spark
			
    
	
		
		
		09-17-2018
	
		
		05:49 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi All,   
we know there are formulas available to deteremine Spark job "Executor memory" and "number of Executor" and "executor cores" based on your cluster available Resources, is there any formula available to calculate the same alone with Data size.  
case 1: what is the configuration if: data size < 5 GB   case 2: what is the configuration if: 5 GB  > data size < 10 GB   case 3: what is the configuration if: 10 GB > data size < 15 GB   case 4: what is the configuration if: 15 GB > data size < 25 GB   case 5: what is the configuration if: data size < 25 GB   Cheers,   MJ 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Apache Spark
			
    
	
		
		
		09-17-2018
	
		
		05:46 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi @vgarg,  Tanx for your reply, i dont think so case statement would helpful for me, coz i need to make this change at globally across all my systems.  Cheers,  MJ   
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		09-04-2018
	
		
		06:25 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi All,  I have Hive table with ORC format, an i have some records stored as Null in the table, is there any configuration in hive to Set Null as Empty string while Querying the data ?  Cheers,  MJ 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Apache Hive
			
    
	
		
		
		08-27-2018
	
		
		03:19 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi Issue got resolved,  i'm trying to perform Group by operation inside a Columns literal, group by itself will produce a new columns instead writing a query like i asked above we have to change our query accordingly as follow.     inputDf = inputDf.groupBy(col("tableName"),col("runDate"))  .agg(sum(when(col("MinBusinessDate")< col("runDate")&& col("MaxBusinessDate")> col("runDate"),  when(col("business_date")> col("runDate"), col("rowCount")))).alias("NewColumnName"))  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-27-2018
	
		
		03:18 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi Issue got resolved,  i'm trying to perform Group by operation inside a Columns literal, group by itself will produce a new columns instead writing a query like i asked above we have to change our query accordingly as follow.  inputDf = inputDf.groupBy(col("tableName"),col("runDate"))    .agg(sum(when(col("MinBusinessDate") < col("runDate") && col("MaxBusinessDate") > col("runDate"),      when(col("business_date") > col("runDate"), col("rowCount")))).alias("NewColumnName")) 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-26-2018
	
		
		02:55 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi All,   I'm trying to add a column to a dataframe based on multiple check condition, one of the operation that we are doing is we need to take sum of rows, but im getting Below error:   Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Dataset [StorageDayCountBeore: double]
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77)
at org.apache.spark.sql.catalyst.expressions.Literal$anonfun$create$2.apply(literals.scala:163)
at org.apache.spark.sql.catalyst.expressions.Literal$anonfun$create$2.apply(literals.scala:163)
at scala.util.Try.getOrElse(Try.scala:79)
at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:162)
at org.apache.spark.sql.functions$.typedLit(functions.scala:112)
at org.apache.spark.sql.functions$.lit(functions.scala:95)
at MYDev.ReconTest$.main(ReconTest.scala:35)
at MYDev.ReconTest.main(ReconTest.scala)  
and the Query im using is:  var df = inputDf  df = df.persist()  inputDf = inputDf.withColumn("newColumn",    when(df("MinBusinessDate") < "2018-08-8" && df("MaxBusinessDate") > "2018-08-08",      lit(df.groupBy(df("tableName"),df("runDate"))        .agg(sum(when(df("business_date") > "2018-08-08", df("rowCount")))          .alias("finalSRCcount"))        .drop("tableName","runDate"))))  Cheers,  MJ 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Apache Spark
			
    
	
		
		
		08-26-2018
	
		
		02:53 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi All,  Im trying to add a column to a dataframe based on multiple check condition, one of the operation that we are doing is we need to take sum of rows, but im getting Below error:  Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Dataset [Column_12: double]
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77)
at org.apache.spark.sql.catalyst.expressions.Literal$anonfun$create$2.apply(literals.scala:163)
at org.apache.spark.sql.catalyst.expressions.Literal$anonfun$create$2.apply(literals.scala:163)
at scala.util.Try.getOrElse(Try.scala:79)
at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:162)
at org.apache.spark.sql.functions$.typedLit(functions.scala:112)
at org.apache.spark.sql.functions$.lit(functions.scala:95)
at MYDev.ReconTest$.main(ReconTest.scala:35)
at MYDev.ReconTest.main(ReconTest.scala)  and the Query im using is:  var df = inputDf  df = df.persist()  inputDf = inputDf.withColumn("newColumn",    when(df("MinBusinessDate") < "2018-08-8" && df("MaxBusinessDate") > "2018-08-08",      lit(df.groupBy(df("tableName"),df("runDate"))        .agg(sum(when(df("business_date") > "2018-08-08", df("rowCount")))          .alias("finalSRCcount"))        .drop("tableName","runDate"))))  Cheers,  MJ 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Apache Spark
			
    
	
		
		
		08-25-2018
	
		
		06:40 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi @Felix Albani  yes your suggestion works fine, i think i have to extend my object as app to make it work. Tanx Bro. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
         
					
				






