Hello Experts, I need some ideas on how to manage relational data in hadoop.
requirements : 1. Replicate relational tables from one of our transactional system into hadoop. Some of the business users want to query near realtime data using hive
2. Multiple tables have to be joined to get the final result and each table will have updates, inserts, deletes captured every day.
I want to consolidate all the data in one data hub ( hadoop) and avoid any redundancy and provide near realtime reporting capability to end users directly on the base data set
How can I design and model relational data in hadoop ?
It is possible, but dependent on your actual approach not the best way to achieve the result.
The first options coming into my mind are:
But is it recommended to model this relational in hadoop/hive? It really depends on your requirements, but it is not the typical use case for a hadoop platform. I rather would go perhaps for another strategy, perhaps 'denormalizing' all your tables into text files, and then use hive, or maybe even go fo a Hbase table.
The issue with joins in a cluster is how the data is distributed across the nodes. If tableA is distributed by its PK and has a FK to tableB, which is also distributed by its PK, it might end up with Row1 of tableA to be joined to Row20 and Row45 of tableB, and Row1 of table A being on one node, rows 20 and 45 of tableB on two other nodes. Now the data needs to travel across the nodes in the cluster, which is increasing network traffic with the size of your cluster, not actually scaling well. And with hdfs it might be difficult to ensure the data is distributed in a fitting manner for your joins. This kind of issue is better handled with a SQL db in a cluster (i.e. Postgre).
If you denormalize the data and store it, the queries can scale with the number of nodes in your cluster much easier, because each nodes holds complete entities (rows) that can be processed locally. And the size of the data shouldn't be an issue in hadoop.
Harald, Thank for your quick response and suggestions
One of the reasons to consider hadoop was to store all the data (OLTP, Unstructured data from other systems(future request)) in one place and provide ability to query across different datasets.
I will definetly consider denormalization and leverage spark for local processing.
Given near realtime data access needs, goal was to reduce the amount of processing and provide faster querying capabilities
Never just replicate SQL tables to Hive or HBase. Hadoop is best for wide tables, denormalized. Joins are never a great idea in Hadoop. You can do deletes and updates in Hive ACID tables, but it's best to keep those as individual records. it's nice to capture those changes, marked with a symbol like D. This way you can have a record of what happened.
How many billion rows will you have per table? How many 100s of terabytes?
Timothy, The requirement is to replicate appx 10 to 15 transaction tables appx > 3 billion rows each and few other tables with < billion rows into hadoop. All the tables have inserts, updates and deletes and estimated growth is 2x to 4x
Do you suggest keeping updates and deletes in one hive table along with inserts and then querying the data based on most recent record ( based on id, modified_date ?)
or maintain two separate current and history tables for each transaction table replicated ?
Denormalizing is another processing step that has to be executed for better performance, before users can query the transaction data