Created 06-06-2016 06:46 PM
I would have liked to use Apache NiFi, but that is not yet available in the current version (coming soon).
I can do it from Sqoop, Pig, Spark, ...
Any other options?
For Relational Database, in bulk; Sqoop seems like a solid option.
For real-time, Spark Streaming?
For batch, Pig?
I am looking for performance, but also ease of use and minimal amount of coding.
Created 06-06-2016 09:10 PM
For a batch model, the "classic" pattern of Sqooping the incremental data you need to ingest into a working directory on HDFS followed by a Pig script that loads (have a hive table defined against this working directory so you can inherit the schema from HCatLoader) the data & does any transformations needed (possibly only a single FOREACH to project it in the correct order) before using HCatStorer to store the data into a pre-existing ORC-back Hive table works for many. You can stitch it all together with an Oozie workflow. I know of a place that uses a simple-and-novel Pig script like this to ingest 500 billion records per day into Hive.
Created 06-06-2016 06:52 PM
I would prefer Sqoop incremental if latency is not a problem, however I came across to this blog which seems interesting though I haven't tried.
http://henning.kropponline.de/2015/05/19/hivesink-for-flume/
Created 06-06-2016 07:15 PM
@Timothy Spann Be aware that Henning's post, while architecturally sound, relies on the "Hive Streaming API", which infers reliance on Hive Transaction support. Current advice is not to rely on transactions, at least until the Hive LLAP TechPreview comes out end of June 2016.
Created 06-06-2016 07:49 PM
I didn't want to use Hive Streaming at this point. I was really focusing on Spark and NiFi. Just curious about Pig, Sqoop and other tools in the HDP stack.
Created 06-06-2016 09:10 PM
For a batch model, the "classic" pattern of Sqooping the incremental data you need to ingest into a working directory on HDFS followed by a Pig script that loads (have a hive table defined against this working directory so you can inherit the schema from HCatLoader) the data & does any transformations needed (possibly only a single FOREACH to project it in the correct order) before using HCatStorer to store the data into a pre-existing ORC-back Hive table works for many. You can stitch it all together with an Oozie workflow. I know of a place that uses a simple-and-novel Pig script like this to ingest 500 billion records per day into Hive.
Created 06-07-2016 02:11 AM
any example on github?