Support Questions

m2014227 · ‎08-07-2016

Hi experts, I've four .CSV (three dimensions and one Fact Table) in my HDFS. I already do some data cleansing in Apache PIG and I want to put them into Hive. My question is: There is a good idea creates the start schema in Hive or is a better idea to create one big table? I didn't find any good article that explains which is the better way to apply data modeling in Big Data. Many thanks!

ravi1 · ‎08-08-2016

How big are your dimension tables? For best speed, some denormalization will help. However, with various improvements to hive and if your dimension tables are small enough for map join, you may not see a lot of difference between the two.

View solution in original post

ravi1 · ‎08-08-2016

How big are your dimension tables? For best speed, some denormalization will help. However, with various improvements to hive and if your dimension tables are small enough for map join, you may not see a lot of difference between the two.

m2014227 · ‎08-08-2016

Thanks Ravi 🙂 Did you recommend any article that explain some methodologies to apply data modeling in Big Data? My dimensions are big, having a lot of columns...

mqureshi · ‎08-08-2016

Hi Johnny

I would also suggest you consider complex types in Hive. They let you store data together for a row and avoid duplicating it and at the same time by not creating normalized tables, you avoid potentially expensive joins.

So think about nested data types like struck, map and array. This is a good middle ground between normalization and denormalization. It doesn't take as much space as a fully denormalized table and at the same time, queries are not as expensive as in a normalized model as you avoid expensive joins.

prodgers125 · ‎08-08-2016

Hi mqureshi, many thanks for your help 🙂 I will look for good articles/tutorials that show me how to use complex Types in Hive. Thanks!

m2014227 · ‎08-08-2016

João Souza, if you find some article can you share here? Many thanks!

Cloudera Community

Support Questions

Data Modeling in Big Data - Star schema into Hive or One Big Table?