Member since
05-11-2016
7
Posts
1
Kudos Received
0
Solutions
05-11-2016
09:59 PM
Thanks @mbalakrishnan, Im currently running Spark Streaming job locally which is writing to the Hive deployed on my cluster. I have added the hive.merge.sparkfiles property. Will this work on files written with the saveAsTable command ?
... View more
05-11-2016
07:25 PM
Thanks, I will make sure the Spark version of the property is set Thanks for the help, I wonder if instead of rdd.toDF().saveAsTable I should be writing insert statements this might force the delta files to be created.
... View more
05-11-2016
06:56 PM
@mbalakrishnan Thanks, yes those properties are set, I believe its something to do with how the data is getting written to Hive via Spark Streaming
... View more
05-11-2016
06:55 PM
@Eric Walk Thanks, Yes you are correct Spark isn't writing deltas its just adding to the existing partition. Any idea on how to get Spark to write the delta's? -rw-r--r-- 3 cmcguire hdfs 0 2016-05-11 16:36 /test_data/test_test_tbl/_SUCCESS
drwxr-xr-x - cmcguire hdfs 0 2016-05-11 16:40 /test_data/test_tbl/dt=11-05-2016
-rwxr-xr-x 3 cmcguire hdfs 3750 2016-05-11 16:37 /test_data/test_tbl/dt=11-05-2016/part-00000
-rwxr-xr-x 3 cmcguire hdfs 5468 2016-05-11 16:37 /test_data/test_tbl/dt=11-05-2016/part-00000_copy_1
-rwxr-xr-x 3 cmcguire hdfs 8264 2016-05-11 16:38 /test_data/test_tbl/dt=11-05-2016/part-00000_copy_2
-rwxr-xr-x 3 cmcguire hdfs 7068 2016-05-11 16:38 /test_data/test_tbl/dt=11-05-2016/part-00000_copy_3
-rwxr-xr-x 3 cmcguire hdfs 5157 2016-05-11 16:39 /test_data/test_tbl/dt=11-05-2016/part-00000_copy_4
-rwxr-xr-x 3 cmcguire hdfs 10684 2016-05-11 16:39 /test_data/test_tbl/dt=11-05-2016/part-00000_copy_5
-rwxr-xr-x 3 cmcguire hdfs 4796 2016-05-11 16:40 /test_data/test_tbl/dt=11-05-2016/part-00000_copy_6
... View more
05-11-2016
11:05 AM
1 Kudo
Hi, I am currently using Spark streaming to write to an external hive table every 30 mins.
rdd.toDF().write.partitionBy("dt").options(options).format("orc").mode(SaveMode.Append).saveAsTable("table_name")
The issue with this is it creates lots of small files in HDFS, like so part-00000
part-00000_copy_1 My table was created with transactions enabled, and I have enabled ACID transactions on the Hive instance however, I can't see any compactions running nor do any get created when I force compaction with ALTER TABLE command. I would expect compaction to run and merge these files as they are very small 200 KB's in size. Any idea's or help greatly appreciated
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
-
Apache Spark