Community Articles

Find and share helpful community-sourced technical articles.
Labels (2)
avatar

Here are the Requirements:

  1. Total Data Size - Uncompressed: 13.5TB; Compressed: 2 TB
  2. Large Virtual Fact Table, View containing a Union All of 3 Large Tables, 11 Billion Records in Total Size
  3. Another view taking the large virtual fact table, with consecutive Left Out Joins on 8 Dimension Tables, so that no matter what 11 Billion records is always the result.
  4. There is timestamp data that you can use to filter rows by.

Suppose you were given the following. How would you begin configuring Hortonworks for Hive? Would you focus on storage? How can we configure for compute?

Lets assume:

  1. Platform: AWS
  2. Data Node Instance: r3_4xlarge
  3. Cores: 16
  4. RAM: 122 GB
  5. EBS Storage: 2 x 1TB Disks

So where do we begin?

First Some Quick Calculations:

Memory per Core: 122GB/16 = 7.625; Approximate 8 GB per CPU Core

This means our largest Container Size PER Node per core is 8 GB

However we should not reserve all 16 Cores to Hadoop. Some Cores are need for OS and other processes.

Let's Assume 14 Cores is reserved for YARN.

Memory Allocated for All YARN containers on a node = No. of Virtual Cores x Memory Per Core

114688 MB = 14 * 8192 MB (8 *1024)

Note Also

  1. At 8 GB, we can run in parallel 14 Tasks (Mappers or Reducers), one per CPU, without wasting RAM. We can certainly run container sizes less than 8GB if we wish,
  2. Since our Optimal Container Size per Node is 8 GB, our Yarn Minimum Container Size must be a factor of 8GB to prevent wastage of memory, that is: 1,2,4,8
  3. However Tez Container Size for Hive is a multiple of Yarn Minimum Container Size

Memory Settings

YARN

9355-screen-shot-2016-11-11-at-115654-am.png

Hive

9356-screen-shot-2016-11-11-at-120149-pm.png

TEZ

9357-screen-shot-2016-11-11-at-120134-pm.png

Running Application

9360-screen-shot-2016-11-11-at-23347-pm.png

Error

3,822 Views