Support Questions

turkaneyogesh09 · ‎03-02-2018

Hi All, i am processing kafka data by spark and push in to hive tables while insert into table face an issue in warehouse location it create new part file for every insert command please share some solution to avoid that problem for single select statement it will take more than 30 min.

import spark.implicits._
// Every time get new data by kafka consumer. assing to jsonStr string. 
val jsonStr ="""{"b_s_isehp" : "false","event_id" : "4.0","l_bsid" : "88.0"}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.coalesce(1).write.mode("append").insertInto("tablename")

bkosaraju · ‎03-06-2018

Hi @yogesh turkane,

As I was across, We can achieve this with two ways.

Post the load of the data or with schedule intervals run the "ALTER TABLE <table_name> CONCATENATE" on the table in SQL api this will merge all the small orc files associated to that table. - Please not that this is specific to ORC
Use the data frame to load the data and re-partition write back with overwrite in spark.

The code snippet would be

val tDf = hiveContext.table("table_name")
tdf.rePartition(<num_Files>).write.mode("overwrite").saveAsTable("targetDB.targetTbale")

the second option will work with any type of files.

Hope this helps !!

Cloudera Community

Support Questions

how to append data in part file while insert in hive table by spark kafka streaming in scala.