Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hive bucketed table from Spark 2.3

avatar

Hi, we have been using a HDI instance with spark 2.2. In this instance we are loading data from spark into a bucketed hive table. We recently looked at moving to HDP 2.6 on cloudbreak but cant get the same code working due to the error "is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive". Is there a way to enable this functionality? and if not is there a reason it works on HDI spark 2.2?

2 REPLIES 2

avatar
@James

Creating Hive bucketed table is supported from Spark 2.3 (Jira SPARK-17729). Spark will disallow users from writing outputs to hive bucketed tables, by default.

Setting `hive.enforce.bucketing=false` and `hive.enforce.sorting=false` will allow you to save to hive bucketed tables.

If you want, you can set those two properties in Custom spark2-hive-site-override on Ambari, then all spark2 application will pick the configurations.

For more details,refer Slideshare.

avatar
Contributor

I did as @ssubhas  said, setting the attributes to false.

 

spark.sql("SET hive.enforce.bucketing=false")
spark.sql("SET hive.enforce.sorting=false")
spark.sql("SET spark.hadoop.hive.exec.dynamic.partition = true")
spark.sql("SET spark.hadoop.hive.exec.dynamic.partition.mode = nonstrict")

newPartitionsDF.write.mode(SaveMode.Append).format("hive").insertInto(this.destinationDBdotTableName)
  • Spark can create the bucketed table in Hive with no issues.
  • Spark inserted the data into the table, but it totally ignored the fact that the table is bucketed. So when I open a partition, I see only 1 file.

 

Matrix_0-1603367064001.png

When inserting, we should set hive.enforce.bucketing = true, not false. And you will face the following error in Spark logs.

org.apache.spark.sql.AnalysisException: Output Hive table `hive_test_db`.`test_bucketing` is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive.;

This means that Spark doesn't support insertion into bucketed Hive tables.

The first answer in this Stackoverflow question, explains that what @ssubhas  suggested is a workaround that doesn't guarantee bucketing.