Created 12-20-2016 11:28 PM
Very basic insert statement in Hive
insert into car(model) values ('Honda'), ('Toyota');
is taking 2-3 minutes to complete. Is it a normal speed for Hive or is it too slow?
I tried both Tez and MapReduce as execution engines, results were almost the same.
Should I expect much better performance (like 2-3 seconds instead of minutes) if increase computer memory? The example below was executed on a single node cluster where computer had 6 Gb.
Created 12-21-2016 05:23 AM
@Dmitry Otblesk could you please try running the query again int tez mode after setting this parameter in hive shell
set hive.tez.exec.print.summary=true;
and share the summary of execution.
Created 12-21-2016 04:24 PM
I did what you suggested. Below is the output
INFO : Tez session hasn't been created yet. Opening session INFO : Dag name: insert into car(model) values (...('Toyota')(Stage-1) INFO : INFO : Status: Running (Executing on YARN cluster with App id application_1482331426058_0001) INFO : Map 1: -/- INFO : Map 1: 0/1 INFO : Map 1: 0/1 INFO : Map 1: 0/1 INFO : Map 1: 0/1 INFO : Map 1: 0/1 INFO : Map 1: 0/1 INFO : Map 1: 0/1 INFO : Map 1: 0(+1)/1 INFO : Map 1: 0(+1)/1 INFO : Map 1: 0(+1)/1 INFO : Map 1: 0(+1)/1 INFO : Map 1: 0(+1)/1 INFO : Map 1: 1/1 INFO : Status: DAG finished successfully in 34.31 seconds INFO : INFO : METHOD DURATION(ms) INFO : parse 45 INFO : semanticAnalyze 3,573 INFO : TezBuildDag 1,482 INFO : TezSubmitToRunningDag 1,433 INFO : TotalPrepTime 51,046 INFO : VERTICES TOTAL_TASKS FAILED_ATTEMPTS KILLED_TASKS DURATION_SECONDS CPU_TIME_MILLIS GC_TIME_MILLIS INPUT_RECORDS OUTPUT_RECORDS INFO : Map 1 1 0 0 14.26 9,290 226 2 0 INFO : org.apache.tez.common.counters.DAGCounter: INFO : NUM_SUCCEEDED_TASKS: 1 INFO : TOTAL_LAUNCHED_TASKS: 1 INFO : DATA_LOCAL_TASKS: 1 INFO : AM_CPU_MILLISECONDS: 7040 INFO : AM_GC_TIME_MILLIS: 19 INFO : File System Counters: INFO : HDFS_BYTES_READ: 13 INFO : HDFS_BYTES_WRITTEN: 80 INFO : HDFS_READ_OPS: 4 INFO : HDFS_WRITE_OPS: 3 INFO : HDFS_OP_CREATE: 2 INFO : HDFS_OP_GET_FILE_STATUS: 3 INFO : HDFS_OP_OPEN: 1 INFO : HDFS_OP_RENAME: 1 INFO : org.apache.tez.common.counters.TaskCounter: INFO : GC_TIME_MILLIS: 226 INFO : CPU_MILLISECONDS: 9290 INFO : PHYSICAL_MEMORY_BYTES: 79691776 INFO : VIRTUAL_MEMORY_BYTES: 2038177792 INFO : COMMITTED_HEAP_BYTES: 79691776 INFO : INPUT_RECORDS_PROCESSED: 2 INFO : INPUT_SPLIT_LENGTH_BYTES: 13 INFO : OUTPUT_RECORDS: 0 INFO : HIVE: INFO : CREATED_FILES: 1 INFO : DESERIALIZE_ERRORS: 0 INFO : RECORDS_IN_Map_1: 2 INFO : RECORDS_OUT_1_default.car: 2 INFO : TaskCounter_Map_1_INPUT_values__tmp__table__1: INFO : INPUT_RECORDS_PROCESSED: 2 INFO : INPUT_SPLIT_LENGTH_BYTES: 13 INFO : TaskCounter_Map_1_OUTPUT_out_Map_1: INFO : OUTPUT_RECORDS: 0 INFO : Loading data to table default.car from hdfs://lenu.dom.hdp:8020/apps/hive/warehouse/car/.hive-staging_hive_2016-12-21_11-19-49_470_4335548459844470170-1/-ext-10000 INFO : Table default.car stats: [numFiles=6, numRows=15, totalSize=103, rawDataSize=88] No rows affected (89.956 seconds)
Created 12-25-2016 10:56 AM
the counter suggest that dag submission took 34 secs while whole execution time was 51 sec, see if there was resource issue at RM side.
Created 12-25-2016 08:26 PM
>see if there was resource issue at RM side.
How can I check that? What exactly should I look at?
Created 12-26-2016 05:03 AM
@Dmitry Otblesk please look into the RM UI after job submission.
Created 12-26-2016 10:50 PM
Login to Ambari UI first then click on YARN link on the left nav bar then on the QuickLinks and chose Resource Manager UI link. You could also go directly to your Resource Manager UI if you know the host where the Resource Manager service runs also the port.
You should also take advantage of Hive Tez View to see all the tasks executed and time needed for each. While you execute the query watch the execution in Resource Manager UI to understand number of containers per task, resource utilization etc. If you see that you have low degree of parallelism and still resources enough to instantiate more containers then you have an opportunity to adjust the query to allow more parallelism.
Change the path from the link to match your version of HDP.