Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to create tables in parquet files? Why the bad performance on Hive?

Highlighted

How to create tables in parquet files? Why the bad performance on Hive?

New Contributor

Greetings,

I have configured an hortownworks 2.6 cluster with Ambari, with 4 nodes. The nodes were configured as node 1 and 2 as masters woth 4 and 8 cores, and 8 and 16GB of RAM and 2 slaves with the same characteriscs. Now i wanted to perform a TPCH Benchmark for 10GB of data on HIVE TEZ, for this i am using Hive-testbench (http://blog.moserit.com/benchmarking-hive). By default this creates hive tables in ORC format, but i wanted to generate in PARQUET (with the command FORMAT=parquet ./tpcds-setup.sh 10), when i created with default orc it created a database named tpch_flat_orc_10, with parquet generation, a hive database is created with the same name as the orc generation. The database shouldnt be something like tpch_flat_parquet_10?

Another question, when configuring ambari masters and slaves, i stick with the recommended defaults, being namenode on node 1 and and SNameNode on node 2, node 3 and node 4 were considered slaves, each one having a data node. But my map reduce jobs are too slow (time frezees for some seconds and then it resumes). For better performance, should i add datanodes instances to master nodes 1 and 2? Or ist a configuration/memory allocation problem?

In terms of configuration parameters in ambari i have:

For HDFS :

-NameNode Java heap size: 1GB;

-DataNode maximum Java heap size:1gb

-DataNode max data transfer threads:4096MB

-NameNode Server threads

For yarn:

ResourceManager and Node manager Java heap size:1024MB

Memory allocated for all YARN containers on a node: 12288

Maximum Container Size (Memory):4096

For MapReduce 2:

Map Memory: 2gb

Reduce Memory:2gb

AppMasterMemory:2gb

Sort Allocation Memory 2GB;

For hive:

Tez Container Size:2048MB

For Map Join, per Map memory threshold: 546.1mb

Data per Reducer:64MB

Default ORC Stripe Size:64mb

Client Heap Size:1024MB

Metastore Heap Size:1024MB

HiveServer2 Heap Size:1024

Im sorry for the long POST, can you give me some hints to improve mapreduce jobs performs.

Thanks

Don't have an account?
Coming from Hortonworks? Activate your account here