Created 08-16-2018 06:43 AM
Hi, we have been using a HDI instance with spark 2.2. In this instance we are loading data from spark into a bucketed hive table. We recently looked at moving to HDP 2.6 on cloudbreak but cant get the same code working due to the error "is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive". Is there a way to enable this functionality? and if not is there a reason it works on HDI spark 2.2?
Created 08-29-2018 07:28 AM
Creating Hive bucketed table is supported from Spark 2.3 (Jira SPARK-17729). Spark will disallow users from writing outputs to hive bucketed tables, by default.
Setting `hive.enforce.bucketing=false` and `hive.enforce.sorting=false` will allow you to save to hive bucketed tables.
If you want, you can set those two properties in Custom spark2-hive-site-override on Ambari, then all spark2 application will pick the configurations.
For more details,refer Slideshare.
Created 10-22-2020 04:59 AM
I did as @ssubhas said, setting the attributes to false.
spark.sql("SET hive.enforce.bucketing=false")
spark.sql("SET hive.enforce.sorting=false")
spark.sql("SET spark.hadoop.hive.exec.dynamic.partition = true")
spark.sql("SET spark.hadoop.hive.exec.dynamic.partition.mode = nonstrict")
newPartitionsDF.write.mode(SaveMode.Append).format("hive").insertInto(this.destinationDBdotTableName)
When inserting, we should set hive.enforce.bucketing = true, not false. And you will face the following error in Spark logs.
org.apache.spark.sql.AnalysisException: Output Hive table `hive_test_db`.`test_bucketing` is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive.;
This means that Spark doesn't support insertion into bucketed Hive tables.
The first answer in this Stackoverflow question, explains that what @ssubhas suggested is a workaround that doesn't guarantee bucketing.