I have configured an hortownworks 2.6 cluster with Ambari, with 4 nodes. The nodes were configured as node 1 and 2 as masters woth 4 and 8 cores, and 8 and 16GB of RAM and 2 slaves with the same characteriscs. Now i wanted to perform a TPCH Benchmark for 10GB of data on HIVE TEZ, for this i am using Hive-testbench (http://blog.moserit.com/benchmarking-hive). By default this creates hive tables in ORC format, but i wanted to generate in PARQUET (with the command FORMAT=parquet ./tpcds-setup.sh 10), when i created with default orc it created a database named tpch_flat_orc_10, with parquet generation, a hive database is created with the same name as the orc generation. The database shouldnt be something like tpch_flat_parquet_10?
Another question, when configuring ambari masters and slaves, i stick with the recommended defaults, being namenode on node 1 and and SNameNode on node 2, node 3 and node 4 were considered slaves, each one having a data node. But my map reduce jobs are too slow (time frezees for some seconds and then it resumes). For better performance, should i add datanodes instances to master nodes 1 and 2? Or ist a configuration/memory allocation problem?
In terms of configuration parameters in ambari i have:
For HDFS :
-NameNode Java heap size: 1GB;
-DataNode maximum Java heap size:1gb
-DataNode max data transfer threads:4096MB
-NameNode Server threads
ResourceManager and Node manager Java heap size:1024MB
Memory allocated for all YARN containers on a node: 12288
Maximum Container Size (Memory):4096
For MapReduce 2:
Map Memory: 2gb
Sort Allocation Memory 2GB;
Tez Container Size:2048MB
For Map Join, per Map memory threshold: 546.1mb
Data per Reducer:64MB
Default ORC Stripe Size:64mb
Client Heap Size:1024MB
Metastore Heap Size:1024MB
HiveServer2 Heap Size:1024
Im sorry for the long POST, can you give me some hints to improve mapreduce jobs performs.