Support Questions
Find answers, ask questions, and share your expertise

Why distribute by some column increase storage size dramatically?

New Contributor

Hi everyone,

there is something weird with my program, I was responsible for processing traffic log and stored the processed data in a table. The log is so big that we need to figure out a good way to reduce the storage cost. The processed data, the table schema has 30 columns, 14 of them are about the devices and the app environment, 8 of them are attributes of point or page. Due to our execution engine is Hive, we use ORC as file format and zlib as the compression method.

To make table easy for BI/BA to use, we also distribute the data by point_name before write to file. Here comes the problem, if we load data without "distribute by xx", it takes 14.5497GB storage, when I try to distribute by point_name(name of point in the web or app), the storage double its size to 28.6083GB, then I experiment with distribute by cuid(the unique id of device), the sotrage size is 11.7286GB, and then I try distribute by point_name+cuid, the storage is 29.6391GB.

I was so confused, how can "distribute by" affected the storage so much. Can someone please explain this to me?

Thanks a lot.

; ;