About vijaynew92

vijaynew92 · ‎10-05-2017

Hey @bkosaraju , thanks for sharing your thoughts. I had this as one of the alternatives but looking is there any other way in Spark SQL like transaction level control (like commit or rollback) if a data frame is not created or for any exception. Thanks you.

vijaynew92 · ‎10-04-2017

I am running Spark SQL on spark V 1.6 in Scala by calling it thru shell script. When any of the step failed during creation of dataframe or inserting data into hive table, still the steps followed by that are executing. Below are the errors: org.apache.spark.sql.AnalysisException: Partition column batchdate not found in existing columns org.apache.spark.sql.AnalysisException: cannot resolve 'batchdate' given input columns: error: not found: value DF1 org.apache.spark.sql.AnalysisException: Table not found: locationtable; How to make my spark-SQL job fail when it returns error and it won't execute subsequent queries, so that control goes to calling shell script. Thanks!!

vijaynew92 · ‎09-16-2017

Hi @Vijayalakshmi Ekambaram, Spark default nature is to perform 200 partitions when doing aggregations , which is defined by the conf variable "spark.sql.shuffle.partitions" (default value is 200). This is the reason you will find lot of small files in the hive URI after each insert into hive table using Spark. you can use the below "coalesce" statement to avoid writing many files into hive. (Coalesce won't perform full shuffle where as repartition does a full shuffle of data across network) new_df.coalesce(2).write.mode("append").partitionBy("week").insertInto(db.tablename) Note: The higher number is introduced to increase the amount of parallelism in Spark, which is useful when we have huge workload.

Online	Offline
Last Visited	‎10-05-2017 08:12 PM

Member Since	‎09-13-2017 08:49 PM
Last Visited	‎10-05-2017 08:12 PM
Posts	4

Cloudera Community

Re: Spark SQL failure handling

Spark SQL failure handling

Re: Merge small files in spark while writing into ...