Support Questions

Find answers, ask questions, and share your expertise

How does hive decide on the insert query plan

avatar
New Contributor

We're working with Hive 1.3.1, and running an INSERT statement to upload data into Hive from an external table.

I noticed that the execution plan has changed on the same table from yesterday compared to today.

Yesterday the plan resulted in a M/R job with 341 mappers and 359 reducers, while today the plan resulted in a M/R job with only mappers and no reducers

This is the query:

explain insert OVERWRITE table managed_table PARTITION(col1) select [columns] from external_table

How does hive decide how to execute the query?

How does it translate the insert select into map reduce?

What would cause a plan to change?

This is the first plan (omitting columns lists because the table has over 300 columns)

<code>STAGE DEPENDENCIES:

  Stage-1 is a root stage

  Stage-0 depends on stages: Stage-1

  Stage-2 depends on stages: Stage-0

STAGE PLANS:

  Stage: Stage-1

    Map Reduce

      Map Operator Tree:

          TableScan
            alias: externalevents
            Statistics: Num rows: 479391 Data size: 91824480256 Basic stats: COMPLETE Column stats: NONE
            Select Operator  [columns] outputColumnNames:  [columns]  Statistics: Num rows: 479391 Data size: 91824480256 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: _col394 (type: bigint)
                sort order: +
                Map-reduce partition columns: _col394 (type: bigint)
                Statistics: Num rows: 479391 Data size: 91824480256 Basic stats: COMPLETE Column stats: NONE
                value expressions: [columns]  Reduce Operator Tree:
        Select Operator
          expressions: [columns]  outputColumnNames: [columns] Statistics: Num rows: 479391 Data size: 91824480256 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 479391 Data size: 91824480256 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
                serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
                name: ceazip.events_test_hive

  Stage: Stage-0
    Move Operator
      tables:
          partition:
            evtf_first_date_id
          replace: true
          table:
              input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
              output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
              serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
              name: ceazip.events_test_hive

  Stage: Stage-2
    Stats-Aggr Operator
<code>And the 2nd plan:
<code>STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: events_2017_02_21_11_13_18
            Statistics: Num rows: 468680 Data size: 89772957696 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: [columns] 
outputColumnNames: [columns] 
 Statistics: Num rows: 468680 Data size: 89772957696 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 468680 Data size: 89772957696 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                    output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
                    serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
                    name: default.events_test1
  Stage: Stage-7
    Conditional Operator
  Stage: Stage-4
    Move Operator
      files:
          hdfs directory: true
          destination: hdfs://...../apps/hive/warehouse/events_test1/.hive-staging_hive_2017-03-01_07-35-18_776_4958999242494325333-1/-ext-10000
  Stage: Stage-0
    Move Operator
      tables:
          partition:
            evtf_first_date_id
          replace: true
          table:
              input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
              output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
              serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
              name: default.events_test1
  Stage: Stage-2
    Stats-Aggr Operator
  Stage: Stage-3
    Merge File Operator
      Map Operator Tree:
          ORC File Merge Operator
      merge level: stripe
      input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
  Stage: Stage-5
    Merge File Operator
      Map Operator Tree:
          ORC File Merge Operator
      merge level: stripe
      input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
  Stage: Stage-6
    Move Operator
      files:
          hdfs directory: true
          destination: hdfs://isr-r0-aps-nam-1.lab.il.nice.com:8020/apps/hive/warehouse/events_test1/.hive-staging_hive_2017-03-01_07-35-18_776_4958999242494325333-1/-ext-10000
Thanks,
Lior
1 ACCEPTED SOLUTION

avatar
Super Guru

@Lior Hadaya

CBO (cost based optimizer) and statistics collected on your tables.

You may have the settings mentioned here set to true: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_performance_tuning/content/hive_perf_bes...

As such, the behavior can change over time. You could also force stats on a specific table or even column.

View solution in original post

1 REPLY 1

avatar
Super Guru

@Lior Hadaya

CBO (cost based optimizer) and statistics collected on your tables.

You may have the settings mentioned here set to true: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_performance_tuning/content/hive_perf_bes...

As such, the behavior can change over time. You could also force stats on a specific table or even column.