Member since
09-13-2017
4
Posts
0
Kudos Received
0
Solutions
10-05-2017
03:25 PM
Hey @bkosaraju , thanks for sharing your thoughts. I had this as one of the alternatives but looking is there any other way in Spark SQL like transaction level control (like commit or rollback) if a data frame is not created or for any exception. Thanks you.
... View more
10-04-2017
02:34 PM
I am running Spark SQL on spark V 1.6 in Scala by calling it thru shell script. When any of the step failed during creation of dataframe or inserting data into hive table, still the steps followed by that are executing. Below are the errors: org.apache.spark.sql.AnalysisException: Partition column batchdate not found in existing columns org.apache.spark.sql.AnalysisException: cannot resolve 'batchdate' given input columns: error: not found: value DF1 org.apache.spark.sql.AnalysisException: Table not found: locationtable; How to make my spark-SQL job fail when it returns error and it won't execute subsequent queries, so that control goes to calling shell script. Thanks!!
... View more
Labels:
- Labels:
-
Apache Spark
09-16-2017
06:31 AM
Hi @Vijayalakshmi Ekambaram, Spark default nature is to perform 200 partitions when doing aggregations , which is defined by the conf variable "spark.sql.shuffle.partitions" (default value is 200). This is the reason you will find lot of small files in the hive URI after each insert into hive table using Spark. you can use the below "coalesce" statement to avoid writing many files into hive. (Coalesce won't perform full shuffle where as repartition does a full shuffle of data across network) new_df.coalesce(2).write.mode("append").partitionBy("week").insertInto(db.tablename) Note: The higher number is introduced to increase the amount of parallelism in Spark, which is useful when we have huge workload.
... View more