Support Questions

TimothySpann · ‎06-06-2016

I would have liked to use Apache NiFi, but that is not yet available in the current version (coming soon).

I can do it from Sqoop, Pig, Spark, ...

Any other options?

For Relational Database, in bulk; Sqoop seems like a solid option.

For real-time, Spark Streaming?

For batch, Pig?

I am looking for performance, but also ease of use and minimal amount of coding.

LesterMartin · ‎06-06-2016

For a batch model, the "classic" pattern of Sqooping the incremental data you need to ingest into a working directory on HDFS followed by a Pig script that loads (have a hive table defined against this working directory so you can inherit the schema from HCatLoader) the data & does any transformations needed (possibly only a single FOREACH to project it in the correct order) before using HCatStorer to store the data into a pre-existing ORC-back Hive table works for many. You can stitch it all together with an Oozie workflow. I know of a place that uses a simple-and-novel Pig script like this to ingest 500 billion records per day into Hive.

View solution in original post

jyadav · ‎06-06-2016

@Timothy Spann

I would prefer Sqoop incremental if latency is not a problem, however I came across to this blog which seems interesting though I haven't tried.

http://henning.kropponline.de/2015/05/19/hivesink-for-flume/

phargis · ‎06-06-2016

@Timothy Spann Be aware that Henning's post, while architecturally sound, relies on the "Hive Streaming API", which infers reliance on Hive Transaction support. Current advice is not to rely on transactions, at least until the Hive LLAP TechPreview comes out end of June 2016.

TimothySpann · ‎06-06-2016

I didn't want to use Hive Streaming at this point. I was really focusing on Spark and NiFi. Just curious about Pig, Sqoop and other tools in the HDP stack.

LesterMartin · ‎06-06-2016

For a batch model, the "classic" pattern of Sqooping the incremental data you need to ingest into a working directory on HDFS followed by a Pig script that loads (have a hive table defined against this working directory so you can inherit the schema from HCatLoader) the data & does any transformations needed (possibly only a single FOREACH to project it in the correct order) before using HCatStorer to store the data into a pre-existing ORC-back Hive table works for many. You can stitch it all together with an Oozie workflow. I know of a place that uses a simple-and-novel Pig script like this to ingest 500 billion records per day into Hive.

TimothySpann · ‎06-07-2016

any example on github?

Cloudera Community

Support Questions

What is the Best Practice for Loading Files into ORC Files and Hive/ORC Tables

Optimizing Hive queries for ORC formatted tables

How to compact ORC files on Hive.

Converting CSV Files to Apache Hive Tables with Ap...

Performance Comparison b/w ORC SNAPPY and ZLib in ...

ORC Creation Best Practices

Hive table compression: bz2 vs Text vs Orc vs Parq...

Tactical modularity in CDE Airflow by loading code...

Import RDBMS into Hive table stored as ORC with SQ...

Hive Query against ORC table failing with serious ...

Tips and best practices for optimizing Hive perfor...