09-21-2016 11:35 PM
Is there a rule of thumb recommendation for modelling Hive tables on HDFS, whether to store them as "wide" with lots of columns OR break them apart into smaller tables which have to be joined during query execution?
Parameters to consider
- Number of attributes, e.g. what is the max reasonable number of columns?
- Use of complex types (arrays, maps) vs parent-child relationships between tables
- Data stored as Avro vs data stored as Parquet
- Usage in Imapala and Hive
09-26-2016 12:13 PM
Giving this a shot to be helpful, I personally have more experience with system performance rather than data design.
Storing the data in HDFS: for performance, you will want to shoot for files at least 1 block size in size (128MB by default) on the surface, smaller tables would seem to translate to a file per table, and depending on your data this might give you files significantly under the block size, which will perform badly. wide rows would seem to have the challenge of growing too large, this would have memory implications for any process trying to read the entire row.
If random access is desired, I would consider using HBase, or at the very least look at the design of hfile for guidance.