Support Questions

james_bashforth · ‎08-16-2018

Hi, we have been using a HDI instance with spark 2.2. In this instance we are loading data from spark into a bucketed hive table. We recently looked at moving to HDP 2.6 on cloudbreak but cant get the same code working due to the error "is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive". Is there a way to enable this functionality? and if not is there a reason it works on HDI spark 2.2?

ssubhas · ‎08-29-2018

@James

Creating Hive bucketed table is supported from Spark 2.3 (Jira SPARK-17729). Spark will disallow users from writing outputs to hive bucketed tables, by default.

Setting `hive.enforce.bucketing=false` and `hive.enforce.sorting=false` will allow you to save to hive bucketed tables.

If you want, you can set those two properties in Custom spark2-hive-site-override on Ambari, then all spark2 application will pick the configurations.

For more details,refer Slideshare.

Matrix · ‎10-22-2020

I did as @ssubhas said, setting the attributes to false.

spark.sql("SET hive.enforce.bucketing=false")
spark.sql("SET hive.enforce.sorting=false")
spark.sql("SET spark.hadoop.hive.exec.dynamic.partition = true")
spark.sql("SET spark.hadoop.hive.exec.dynamic.partition.mode = nonstrict")

newPartitionsDF.write.mode(SaveMode.Append).format("hive").insertInto(this.destinationDBdotTableName)

Spark can create the bucketed table in Hive with no issues.
Spark inserted the data into the table, but it totally ignored the fact that the table is bucketed. So when I open a partition, I see only 1 file.

When inserting, we should set hive.enforce.bucketing = true, not false. And you will face the following error in Spark logs.

org.apache.spark.sql.AnalysisException: Output Hive table `hive_test_db`.`test_bucketing` is bucketed but Spark currently does NOT populate bucketed output which is compatible with Hive.;

This means that Spark doesn't support insertion into bucketed Hive tables.

The first answer in this Stackoverflow question, explains that what @ssubhas suggested is a workaround that doesn't guarantee bucketing.

Cloudera Community

Support Questions

Hive bucketed table from Spark 2.3