- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
What is the Best Practice for Loading Files into ORC Files and Hive/ORC Tables
- Labels:
-
Apache Hive
-
Apache Pig
-
Apache Spark
Created ‎06-06-2016 06:46 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I would have liked to use Apache NiFi, but that is not yet available in the current version (coming soon).
I can do it from Sqoop, Pig, Spark, ...
Any other options?
For Relational Database, in bulk; Sqoop seems like a solid option.
For real-time, Spark Streaming?
For batch, Pig?
I am looking for performance, but also ease of use and minimal amount of coding.
Created ‎06-06-2016 09:10 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For a batch model, the "classic" pattern of Sqooping the incremental data you need to ingest into a working directory on HDFS followed by a Pig script that loads (have a hive table defined against this working directory so you can inherit the schema from HCatLoader) the data & does any transformations needed (possibly only a single FOREACH to project it in the correct order) before using HCatStorer to store the data into a pre-existing ORC-back Hive table works for many. You can stitch it all together with an Oozie workflow. I know of a place that uses a simple-and-novel Pig script like this to ingest 500 billion records per day into Hive.
Created ‎06-06-2016 06:52 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I would prefer Sqoop incremental if latency is not a problem, however I came across to this blog which seems interesting though I haven't tried.
http://henning.kropponline.de/2015/05/19/hivesink-for-flume/
Created ‎06-06-2016 07:15 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Timothy Spann Be aware that Henning's post, while architecturally sound, relies on the "Hive Streaming API", which infers reliance on Hive Transaction support. Current advice is not to rely on transactions, at least until the Hive LLAP TechPreview comes out end of June 2016.
Created ‎06-06-2016 07:49 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I didn't want to use Hive Streaming at this point. I was really focusing on Spark and NiFi. Just curious about Pig, Sqoop and other tools in the HDP stack.
Created ‎06-06-2016 09:10 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For a batch model, the "classic" pattern of Sqooping the incremental data you need to ingest into a working directory on HDFS followed by a Pig script that loads (have a hive table defined against this working directory so you can inherit the schema from HCatLoader) the data & does any transformations needed (possibly only a single FOREACH to project it in the correct order) before using HCatStorer to store the data into a pre-existing ORC-back Hive table works for many. You can stitch it all together with an Oozie workflow. I know of a place that uses a simple-and-novel Pig script like this to ingest 500 billion records per day into Hive.
Created ‎06-07-2016 02:11 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
any example on github?
