Created on 11-11-201606:18 PM - edited 08-17-201908:18 AM
Here are the Requirements:
Total Data Size - Uncompressed: 13.5TB; Compressed: 2 TB
Large Virtual Fact Table, View containing a Union All of 3 Large Tables, 11 Billion Records in Total Size
Another view taking the large virtual fact table, with consecutive Left Out Joins on 8 Dimension Tables, so that no matter what 11 Billion records is always the result.
There is timestamp data that you can use to filter rows by.
Suppose you were given the following. How would you begin configuring Hortonworks for Hive? Would you focus on storage? How can we configure for compute?
Lets assume:
Platform: AWS
Data Node Instance: r3_4xlarge
Cores: 16
RAM: 122 GB
EBS Storage: 2 x 1TB Disks
So where do we begin?
First Some Quick Calculations:
Memory per Core: 122GB/16 = 7.625; Approximate 8 GB per CPU Core
This means our largest Container Size PER Node per core is 8 GB
However we should not reserve all 16 Cores to Hadoop. Some Cores are need for OS and other processes.
Let's Assume 14 Cores is reserved for YARN.
Memory Allocated for All YARN containers on a node = No. of Virtual Cores x Memory Per Core
114688 MB = 14 * 8192 MB (8 *1024)
Note Also
At 8 GB, we can run in parallel 14 Tasks (Mappers or Reducers), one per CPU, without wasting RAM. We can certainly run container sizes less than 8GB if we wish,
Since our Optimal Container Size per Node is 8 GB, our Yarn Minimum Container Size must be a factor of 8GB to prevent wastage of memory, that is: 1,2,4,8
However Tez Container Size for Hive is a multiple of Yarn Minimum Container Size