We would like to implement bucketing on some of our tables but we struggle getting them readable in Hive.
First we had issue with saveAsTable as stated here. That being solved, we did our bucketing like ;
spark.table("large_table_1").write.options(Map("path" -> "/path/warehouse/bucketed_large_table_1")).bucketBy(100, "num1").sortBy("num1").saveAsTable("bucketed_large_table_1")
The bucketing works fine and we avoided some costly shuffle steps in our ETL.
Though, Hive is not happy with our table schema when bucketed :
Any suggestion ?
I believe buckting support with hive is being greatly improved at the moment but yet it's sad that we cannot use it properly.