Support Questions

chrismcg89 · ‎05-11-2016

Hi,

I am currently using Spark streaming to write to an external hive table every 30 mins.

rdd.toDF().write.partitionBy("dt").options(options).format("orc").mode(SaveMode.Append).saveAsTable("table_name")

The issue with this is it creates lots of small files in HDFS, like so

	part-00000
	part-00000_copy_1

My table was created with transactions enabled, and I have enabled ACID transactions on the Hive instance however, I can't see any compactions running nor do any get created when I force compaction with ALTER TABLE command. I would expect compaction to run and merge these files as they are very small 200 KB's in size.

Any idea's or help greatly appreciated

ewalk · ‎05-11-2016

Hi @Chris McGuire,

Can you please provide an "hdfs dfs -ls -R <table-folder>"

Compaction only operates on tables with delta directories. I suspect that the method you're using (SaveMode.Append) is just appending to the existing partition (or adding a new partition) and not actually creating deltas.

Best,

Eric

View solution in original post

chrismcg89 · ‎05-11-2016

Thanks @mbalakrishnan, Im currently running Spark Streaming job locally which is writing to the Hive deployed on my cluster. I have added the hive.merge.sparkfiles property. Will this work on files written with the saveAsTable command ?

quadoss · ‎05-11-2016

@Chris McGuire I'm not sure whether this would work on saveAsTable command since I have very limited to no knowledge on spark. I'm hoping that this property should work for the spark streaming job as well.

ekoifman · ‎05-12-2016

Hive Acid tables are not integrated with Spark. To write to an Acid table in a streaming fashion you could use https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-StreamingAPIs

(hdfs dfs -ls -R output shows the table to not be in expected format for Acid table. You can check metastore log for errors regarding compaction, but I would not expect it to work)

gaurang_n_shah · ‎09-07-2018

Compaction works only on transactional table, and to make any table transactional it should meet following properties.

Should be ORC Table
Should be bucketed
Should be managed table.

Due you see the last point, you can't run compaction on non transactional table, if you do it from hive you will definitely get error, not sure from spark.

Cloudera Community

Support Questions

Hive compactions on External table