Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

What is the Best Practice for Loading Files into ORC Files and Hive/ORC Tables

avatar
Master Guru

I would have liked to use Apache NiFi, but that is not yet available in the current version (coming soon).

I can do it from Sqoop, Pig, Spark, ...

Any other options?

For Relational Database, in bulk; Sqoop seems like a solid option.

For real-time, Spark Streaming?

For batch, Pig?

I am looking for performance, but also ease of use and minimal amount of coding.

1 ACCEPTED SOLUTION

avatar

For a batch model, the "classic" pattern of Sqooping the incremental data you need to ingest into a working directory on HDFS followed by a Pig script that loads (have a hive table defined against this working directory so you can inherit the schema from HCatLoader) the data & does any transformations needed (possibly only a single FOREACH to project it in the correct order) before using HCatStorer to store the data into a pre-existing ORC-back Hive table works for many. You can stitch it all together with an Oozie workflow. I know of a place that uses a simple-and-novel Pig script like this to ingest 500 billion records per day into Hive.

View solution in original post

5 REPLIES 5

avatar
Super Guru
@Timothy Spann

I would prefer Sqoop incremental if latency is not a problem, however I came across to this blog which seems interesting though I haven't tried.

http://henning.kropponline.de/2015/05/19/hivesink-for-flume/

avatar

@Timothy Spann Be aware that Henning's post, while architecturally sound, relies on the "Hive Streaming API", which infers reliance on Hive Transaction support. Current advice is not to rely on transactions, at least until the Hive LLAP TechPreview comes out end of June 2016.

avatar
Master Guru

I didn't want to use Hive Streaming at this point. I was really focusing on Spark and NiFi. Just curious about Pig, Sqoop and other tools in the HDP stack.

avatar

For a batch model, the "classic" pattern of Sqooping the incremental data you need to ingest into a working directory on HDFS followed by a Pig script that loads (have a hive table defined against this working directory so you can inherit the schema from HCatLoader) the data & does any transformations needed (possibly only a single FOREACH to project it in the correct order) before using HCatStorer to store the data into a pre-existing ORC-back Hive table works for many. You can stitch it all together with an Oozie workflow. I know of a place that uses a simple-and-novel Pig script like this to ingest 500 billion records per day into Hive.

avatar
Master Guru

any example on github?