Options
- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
how to append data in part file while insert in hive table by spark kafka streaming in scala.
Labels:
- Labels:
-
Apache Hive
-
Apache Kafka
-
Apache Spark
Explorer
Created ‎03-02-2018 02:41 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All, i am processing kafka data by spark and push in to hive tables while insert into table face an issue in warehouse location it create new part file for every insert command please share some solution to avoid that problem for single select statement it will take more than 30 min.
import spark.implicits._ // Every time get new data by kafka consumer. assing to jsonStr string. val jsonStr ="""{"b_s_isehp" : "false","event_id" : "4.0","l_bsid" : "88.0"}""" val df = spark.read.json(Seq(jsonStr).toDS) df.coalesce(1).write.mode("append").insertInto("tablename")
1 REPLY 1
Super Collaborator
Created ‎03-06-2018 01:27 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @yogesh turkane,
As I was across, We can achieve this with two ways.
- Post the load of the data or with schedule intervals run the "ALTER TABLE <table_name> CONCATENATE" on the table in SQL api this will merge all the small orc files associated to that table. - Please not that this is specific to ORC
- Use the data frame to load the data and re-partition write back with overwrite in spark.
The code snippet would be
val tDf = hiveContext.table("table_name") tdf.rePartition(<num_Files>).write.mode("overwrite").saveAsTable("targetDB.targetTbale")
the second option will work with any type of files.
Hope this helps !!
