Support Questions

prodgers125 · ‎04-07-2016

Hello, I have a Oracle DW which is connect to my core to record the data. I'm planning introduce the ecosystem Hadoop in my next release. What's the best option/architecture (If anyone knows some article that talk about this, It would be great to share):

Put Hadoop connect directly to Core and record all the operational data. Basically have 2 steps (Operational Data -> Hadoop)
Put Hadoop connect to my Data Warehouse and have 3 stepS (Operational Data -> DW -> Hadoop). I repeat, It will be great if anyone could share some articles related to this 🙂 Thanks! 🙂

bleonhardi · ‎04-07-2016

There are lots of resources out there, for example:

http://info.hortonworks.com/rs/h2source/images/Hadoop-Data-Lake-white-paper.pdf%20

The main question is what you want to do consumption wise.

Hive has progressed a lot with Tez, ORC, predicate pushdown, a cost based optimizer etc. and should be easily able to outdo an Oracle DW for heavy analytical queries, unless you have a really expensive Exadata system ( even then it depends on the queries ) It would be great for example for running daily/weekly aggregations. However it will not be able (yet, LLAP is coming) to compete with Oracle for smaller interactive queries and it will be hard to directly connect a reporting studio to it with interactive queries for example.

So the question is what you plan to do with the Warehouse. Do you directly connect to it with reporting software that runs in interactive mode ( Hive is not good for that yet although Phoenix might be ) Or do you mostly use it for aggregation queries, cube generation etc. ( For example by daily creating a Tableau report that is then provided to users. ) In that case you might be able to get rid of your warehouse entirely. ( Some tools like Platfora provide a neat combination )

Of course an Oracle warehouse provides some other advantages like UPDATE/DELETE capabilities, ( Hive has ACID now but its pretty new ), stored procedures, foreign key relationships ( however often disabled for big loads ) integration with most common backup tools etc. So there may be some other reasons to keep a warehouse around as main store.

In this case you would go to one of the following scenarios

- data warehouse offloading

Use hadoop for workloads like heavy aggregations that take up too much of the Warehouse capacity and are better done in hadoop

- landing zone

Use hadoop as an initial data landing and transformation zone before it goes into the warehouse. Utilize the capability of hadoop to store data in any format and keep large amounts of data around.

- hot archiving

Archive old data into Hadoop to have it still accessible

- Advanced analytics

Run things like sparkml or R on Hadoop that may not be possible either technically or because of costs in the warehouse

- Unstructured Analytics

Augment your warehouse with unstructured data ( emails, social media data , ... ) in Hadoop.

- Realtime Ingestion using Flume,Kafka,Spark Streaming, Storm

So summarizing all approaches is valid

Source -> Hadoop ( if you can completely replace the warehouse)

Source -> Hadoop -> Warehouse ( with hadoop as etl landing zone which gives you advantages like keeping source data around if desired, reducing need for massive ETL installations ( you might still have them but many can push heavy computations into hadoop so could be smaller , ... )

Source -> Warehouse -> Hadoop ( Warehouse offloading, archival ) that has the advantage that you can keep the existing environment but you can reduce pressure on your warehouse.

Hope that helps.

View solution in original post

bleonhardi · ‎04-07-2016

There are lots of resources out there, for example:

http://info.hortonworks.com/rs/h2source/images/Hadoop-Data-Lake-white-paper.pdf%20

The main question is what you want to do consumption wise.

Hive has progressed a lot with Tez, ORC, predicate pushdown, a cost based optimizer etc. and should be easily able to outdo an Oracle DW for heavy analytical queries, unless you have a really expensive Exadata system ( even then it depends on the queries ) It would be great for example for running daily/weekly aggregations. However it will not be able (yet, LLAP is coming) to compete with Oracle for smaller interactive queries and it will be hard to directly connect a reporting studio to it with interactive queries for example.

So the question is what you plan to do with the Warehouse. Do you directly connect to it with reporting software that runs in interactive mode ( Hive is not good for that yet although Phoenix might be ) Or do you mostly use it for aggregation queries, cube generation etc. ( For example by daily creating a Tableau report that is then provided to users. ) In that case you might be able to get rid of your warehouse entirely. ( Some tools like Platfora provide a neat combination )

Of course an Oracle warehouse provides some other advantages like UPDATE/DELETE capabilities, ( Hive has ACID now but its pretty new ), stored procedures, foreign key relationships ( however often disabled for big loads ) integration with most common backup tools etc. So there may be some other reasons to keep a warehouse around as main store.

In this case you would go to one of the following scenarios

- data warehouse offloading

Use hadoop for workloads like heavy aggregations that take up too much of the Warehouse capacity and are better done in hadoop

- landing zone

Use hadoop as an initial data landing and transformation zone before it goes into the warehouse. Utilize the capability of hadoop to store data in any format and keep large amounts of data around.

- hot archiving

Archive old data into Hadoop to have it still accessible

- Advanced analytics

Run things like sparkml or R on Hadoop that may not be possible either technically or because of costs in the warehouse

- Unstructured Analytics

Augment your warehouse with unstructured data ( emails, social media data , ... ) in Hadoop.

- Realtime Ingestion using Flume,Kafka,Spark Streaming, Storm

So summarizing all approaches is valid

Source -> Hadoop ( if you can completely replace the warehouse)

Source -> Hadoop -> Warehouse ( with hadoop as etl landing zone which gives you advantages like keeping source data around if desired, reducing need for massive ETL installations ( you might still have them but many can push heavy computations into hadoop so could be smaller , ... )

Source -> Warehouse -> Hadoop ( Warehouse offloading, archival ) that has the advantage that you can keep the existing environment but you can reduce pressure on your warehouse.

Hope that helps.

Cloudera Community

Support Questions

Integrate a Oracle DW with Hadoop or transit all operational data to Hadoop?