Created 08-07-2016 02:22 PM
Hi experts, I've four .CSV (three dimensions and one Fact Table) in my HDFS. I already do some data cleansing in Apache PIG and I want to put them into Hive. My question is: There is a good idea creates the start schema in Hive or is a better idea to create one big table? I didn't find any good article that explains which is the better way to apply data modeling in Big Data. Many thanks!
Created 08-08-2016 04:14 AM
How big are your dimension tables? For best speed, some denormalization will help. However, with various improvements to hive and if your dimension tables are small enough for map join, you may not see a lot of difference between the two.
Created 08-08-2016 04:14 AM
How big are your dimension tables? For best speed, some denormalization will help. However, with various improvements to hive and if your dimension tables are small enough for map join, you may not see a lot of difference between the two.
Created 08-08-2016 08:03 AM
Thanks Ravi 🙂 Did you recommend any article that explain some methodologies to apply data modeling in Big Data? My dimensions are big, having a lot of columns...
Created 08-08-2016 04:03 PM
Hi Johnny
I would also suggest you consider complex types in Hive. They let you store data together for a row and avoid duplicating it and at the same time by not creating normalized tables, you avoid potentially expensive joins.
So think about nested data types like struck, map and array. This is a good middle ground between normalization and denormalization. It doesn't take as much space as a fully denormalized table and at the same time, queries are not as expensive as in a normalized model as you avoid expensive joins.
Created 08-08-2016 04:09 PM
Hi mqureshi, many thanks for your help 🙂 I will look for good articles/tutorials that show me how to use complex Types in Hive. Thanks!
Created 08-08-2016 09:29 PM
João Souza, if you find some article can you share here? Many thanks!