Community Articles

amcbarnett · ‎11-11-2016

Here are the Requirements:

Total Data Size - Uncompressed: 13.5TB; Compressed: 2 TB
Large Virtual Fact Table, View containing a Union All of 3 Large Tables, 11 Billion Records in Total Size
Another view taking the large virtual fact table, with consecutive Left Out Joins on 8 Dimension Tables, so that no matter what 11 Billion records is always the result.
There is timestamp data that you can use to filter rows by.

Suppose you were given the following. How would you begin configuring Hortonworks for Hive? Would you focus on storage? How can we configure for compute?

Lets assume:

Platform: AWS
Data Node Instance: r3_4xlarge
Cores: 16
RAM: 122 GB
EBS Storage: 2 x 1TB Disks

So where do we begin?

First Some Quick Calculations:

Memory per Core: 122GB/16 = 7.625; Approximate 8 GB per CPU Core

This means our largest Container Size PER Node per core is 8 GB

However we should not reserve all 16 Cores to Hadoop. Some Cores are need for OS and other processes.

Let's Assume 14 Cores is reserved for YARN.

Memory Allocated for All YARN containers on a node = No. of Virtual Cores x Memory Per Core

114688 MB = 14 * 8192 MB (8 *1024)

Note Also

At 8 GB, we can run in parallel 14 Tasks (Mappers or Reducers), one per CPU, without wasting RAM. We can certainly run container sizes less than 8GB if we wish,
Since our Optimal Container Size per Node is 8 GB, our Yarn Minimum Container Size must be a factor of 8GB to prevent wastage of memory, that is: 1,2,4,8
However Tez Container Size for Hive is a multiple of Yarn Minimum Container Size

Memory Settings

YARN

Hive

TEZ

Running Application

Error

Cloudera Community

Community Articles

Tuning Large Hive Queries - Part 1

Apache Hive

Apache Tez

Tuning Hbase for optimized performance ( Part 4 )

Phoenix Index Basics - Part 1

OLAP in Hadoop - Introduction ( Part 1 )

Druid - Part 1

Part 1 - Ad-hoc Query Workloads on HDP

Ambari Admin Utility - Part 1

NiFi/HDF Dataflow Optimization (Part 1 of 2)

Querying large datasets in Cloudera - Emmanuel Kat...

Counting lines in text files with NiFi - part 1

Test Driven Development for Big Data (Unofficial G...