Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Best practise stream Datapipeline on a hadoop

Solved Go to solution
Highlighted

Best practise stream Datapipeline on a hadoop

Explorer

hi,

we like to implement a pretty basic stream datapipeline to our hadoop cluster.

App events are already send to a Kafka Topic. The perfect solution would be to stream the data (Json) directly to a HIVE table. So that the BI Team can do its analysises nearly in realtime on those information. I researched a bit but did not found any "Best practice" solution for that case. We use Hortonworks Hadoop HDP with the "Basic" Techstack such as Flume,Spark...

Here my questions:

- what is the best practise for an event stream to BI?

- Is there an example which fits that case?

Thanks in advance KF2

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Best practise stream Datapipeline on a hadoop

So the data is already in a Kafka Topic?

Then you have a whole flower arrangement of possibilities to stream the data into hdfs/hive. One question is if you want your hive tables to be ORC or if they can be delimited.

a) Directly streaming into Hive tables using Hive ACID

http://henning.kropponline.de/2015/01/24/hive-streaming-with-storm/

I don't like this approach too much since Hive ACID is still very new. However it has been out for a while and may be worth a shot. It would create ORC files directly

b) Stream data into HDFS using Storm ( HDFSBolt ) then use a rotator to move data into hive table partition

http://hortonworks.com/hadoop-tutorial/processing-streaming-data-near-real-time-apache-storm/

You can also schedule an oozie job every 15min/1h to create ORC files. Normally that cadence is good enough for batch queries and you can run any realtime queries in storm directly.

c) Spark Streaming

Similar to Storm you can run realtime queries directly in spark streaming ( you can even use Spark SQL if you like SQL ). You can then write into a hive table. You just need to make sure you don't create too small files. So if your writes to the hive table are supposed to be in a very short timeframe you will run into issues. But as said normally this is not needed since you can run your realtime queries directly in spark.

If you need a SQL interface that allows you to insert and query data in seconds and is pretty stable you could look at phoenix:

https://community.hortonworks.com/articles/25726/spark-streaming-explained-kafka-to-phoenix.html

d) Tons of frameworks that move kafka data into hdfs directly like camus or gobblin

https://github.com/linkedin/camus

If you ask me? I would most likely go on the save side and use storm or spark streaming to write to hdfs folders. Then have a background task running that creates ORC partitions in the background. ( oozie/falcon ) this way your main ( orced ) hive table will be nice fast and optimized, you can create an external table on the delimited intermediate results ( and union them with the ORC table ) and you can run any really realtime queries in storm/spark streaming. If you want to query data with SQL in realtime and you don't want to aggregate more than a couple million rows at a time then Phoenix would be better than hive.

Once Hive ACID is more stable this will be the way to go though.

View solution in original post

2 REPLIES 2
Highlighted

Re: Best practise stream Datapipeline on a hadoop

So the data is already in a Kafka Topic?

Then you have a whole flower arrangement of possibilities to stream the data into hdfs/hive. One question is if you want your hive tables to be ORC or if they can be delimited.

a) Directly streaming into Hive tables using Hive ACID

http://henning.kropponline.de/2015/01/24/hive-streaming-with-storm/

I don't like this approach too much since Hive ACID is still very new. However it has been out for a while and may be worth a shot. It would create ORC files directly

b) Stream data into HDFS using Storm ( HDFSBolt ) then use a rotator to move data into hive table partition

http://hortonworks.com/hadoop-tutorial/processing-streaming-data-near-real-time-apache-storm/

You can also schedule an oozie job every 15min/1h to create ORC files. Normally that cadence is good enough for batch queries and you can run any realtime queries in storm directly.

c) Spark Streaming

Similar to Storm you can run realtime queries directly in spark streaming ( you can even use Spark SQL if you like SQL ). You can then write into a hive table. You just need to make sure you don't create too small files. So if your writes to the hive table are supposed to be in a very short timeframe you will run into issues. But as said normally this is not needed since you can run your realtime queries directly in spark.

If you need a SQL interface that allows you to insert and query data in seconds and is pretty stable you could look at phoenix:

https://community.hortonworks.com/articles/25726/spark-streaming-explained-kafka-to-phoenix.html

d) Tons of frameworks that move kafka data into hdfs directly like camus or gobblin

https://github.com/linkedin/camus

If you ask me? I would most likely go on the save side and use storm or spark streaming to write to hdfs folders. Then have a background task running that creates ORC partitions in the background. ( oozie/falcon ) this way your main ( orced ) hive table will be nice fast and optimized, you can create an external table on the delimited intermediate results ( and union them with the ORC table ) and you can run any really realtime queries in storm/spark streaming. If you want to query data with SQL in realtime and you don't want to aggregate more than a couple million rows at a time then Phoenix would be better than hive.

Once Hive ACID is more stable this will be the way to go though.

View solution in original post

Re: Best practise stream Datapipeline on a hadoop

Explorer

Thank you for your answers, that really helps. Im a bit further now

Right now: A croned python script on the NameNode writes the kafka stream every 5 min to hdfs. (External Table JSON). Every hour another script which executes a "insert overwrite" moves the data from the external table to an orc partitioned and clustered table. This table should be the BI Table for realtime Analysis. My next plan would be to change the 1. script to directly update/insert the hive table, so that i can eleminate the second script. Thanks for any suggestions.

Don't have an account?
Coming from Hortonworks? Activate your account here