Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

how to append data in part file while insert in hive table by spark kafka streaming in scala.

how to append data in part file while insert in hive table by spark kafka streaming in scala.

New Contributor

Hi All, i am processing kafka data by spark and push in to hive tables while insert into table face an issue in warehouse location it create new part file for every insert command please share some solution to avoid that problem for single select statement it will take more than 30 min.

import spark.implicits._
// Every time get new data by kafka consumer. assing to jsonStr string. 
val jsonStr ="""{"b_s_isehp" : "false","event_id" : "4.0","l_bsid" : "88.0"}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.coalesce(1).write.mode("append").insertInto("tablename")

part-files-in-hadoop.png
1 REPLY 1
Highlighted

Re: how to append data in part file while insert in hive table by spark kafka streaming in scala.

Super Collaborator

Hi @yogesh turkane,

As I was across, We can achieve this with two ways.

  1. Post the load of the data or with schedule intervals run the "ALTER TABLE <table_name> CONCATENATE" on the table in SQL api this will merge all the small orc files associated to that table. - Please not that this is specific to ORC
  2. Use the data frame to load the data and re-partition write back with overwrite in spark.

The code snippet would be

val tDf = hiveContext.table("table_name")
tdf.rePartition(<num_Files>).write.mode("overwrite").saveAsTable("targetDB.targetTbale")

the second option will work with any type of files.

Hope this helps !!

Don't have an account?
Coming from Hortonworks? Activate your account here