Support Questions
Find answers, ask questions, and share your expertise

Hive query extremely slow for a very simple data.

New Contributor

I just start to learn about the hive, and I build a lab with few VMs in a test server here:

  • CPU Logical processors 8
  • Processor type Intel(R) Xeon(R) CPU E5-2407 0 @ 2.20GHz
  • Sockets 2 (2 physical processors)
  • Cores per socket 4
  • Hyperthreading No
  • Memory 31.96 GB (2x16GB DDR3)

I this server I virtualized few machines with some roles:

VMs on the server EXSI6.5

Each node become a role specified above

Hosts roles

The health is not the best, but I think it's ok for a test-lab.

The only service installed apart was the Apache Superset (I didn't find anything similar on CHD)

I imported a test CSV with 170k rows and 20 columns, but it's too heavy to create any Chart.

So I split it into 3 tables, and some help tables like value ordered:

 

CREATE TABLE analysis.totalByValue (cod VARCHAR(100), total_value DOUBLE)
STORED AS ORC tblproperties("compress.mode"="SNAPPY");

INSERT INTO  analysis.totalByValue  SELECT `cod`, SUM(total) AS total_value FROM analysis.raw_table GROUP BY cod;

 

 

CREATE TABLE analysis.top10 (cod VARCHAR(100), total_value DOUBLE)
STORED AS ORC tblproperties("compress.mode"="SNAPPY");


INSERT into analysis.top10 SELECT * from analysis.totalByValue 
order by total_value DESC
limit 10;

 

Ok, now I have the top10 list, a table with 2 columns and 10 records and my headache 😉

 

--- Simple select, ok --
SELECT * FROM analysis.top10;
--- QUERY TIME: 00:00:0.15 --
--- Count of 10, 1 minute and 45 seconds? -- 
SELECT COUNT(*) FROM analysis.top10;
--- QUERY TIME: 00:01:45.2 ---

-- 1 extra second just for alias --
SELECT cod AS col1 , total_value AS col2 FROM analysis.top10;
--- QUERY TIME: 00:00:01.67 --

-- More complex query , but same table with 10 rows and 2 columns --
SELECT `cod` AS `cod`,
       SUM(`total_value`) AS `SUM(total_value)`
FROM `analysis`.`top10`
GROUP BY `contribuinte`
ORDER BY `SUM(total_value)` DESC
LIMIT 10;
--- QUERY TIME: 00:02:37.45 --

 

The last and more complex query is generated by the "Distribution - Bar Chart" view of Apache Superset, and take a lot to be finished that make impractical any chart with this data.

The same time is accomplished if I use beeline or hive or hue or any other query. 

Here is the info about spark jobs that I found:

DAG VisualizationUnfortunately, I could not avoid Group and Ordering because the Chart in Apache Superset generate the query in that way. 

Event TimelineThe one thing I check in the Event Timeline is a gap of 18 seconds between executor driver addiction and Executor added completion, then 11 seconds to start Map Stage.

And stage descriptions:

  • Map 1 Explain Plan:

 

  Map Operator Tree:
      TableScan
        alias: top10
        Statistics: Num rows: 10 Data size: 186 Basic stats: COMPLETE Column stats: NONE
        Select Operator
          expressions: contribuinte (type: string), total_value(type: double)
          outputColumnNames: contribuinte, total_value
          Statistics: Num rows: 10 Data size: 186 Basic stats: COMPLETE Column stats: NONE
          Group By Operator
            aggregations: sum(total_value)
            keys: contribuinte (type: string)
            mode: hash
            outputColumnNames: _col0, _col1
            Statistics: Num rows: 10 Data size: 186 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: _col0 (type: string)
              sort order: +
              Map-reduce partition columns: _col0 (type: string)
              Statistics: Num rows: 10 Data size: 186 Basic stats: COMPLETE Column stats: NONE
              value expressions: _col1 (type: double)

 

Time: 16s

  • Reducer 2 Explain Plan:

 

Execution mode: vectorized
  Reduce Operator Tree:
    Group By Operator
      aggregations: sum(VALUE._col0)
      keys: KEY._col0 (type: string)
      mode: mergepartial
      outputColumnNames: _col0, _col1
      Statistics: Num rows: 5 Data size: 93 Basic stats: COMPLETE Column stats: NONE
      Reduce Output Operator
        key expressions: _col1 (type: double)
        sort order: -
        Statistics: Num rows: 5 Data size: 93 Basic stats: COMPLETE Column stats: NONE
        TopN Hash Memory Usage: 0.1
        value expressions: _col0 (type: string)

 

Time: 1s

  • Reducer 3 Explain Plan:

 

 Execution mode: vectorized
  Reduce Operator Tree:
    Select Operator
      expressions: VALUE._col0 (type: string), KEY.reducesinkkey0 (type: double)
      outputColumnNames: _col0, _col1
      Statistics: Num rows: 5 Data size: 93 Basic stats: COMPLETE Column stats: NONE
      Limit
        Number of rows: 1000
        Statistics: Num rows: 5 Data size: 93 Basic stats: COMPLETE Column stats: NONE
        File Output Operator
          compressed: false
          Statistics: Num rows: 5 Data size: 93 Basic stats: COMPLETE Column stats: NONE
          table:
              input format: org.apache.hadoop.mapred.SequenceFileInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

 

Time: 1s

 I'm currently using spark to run these jobs, but the result is not different using Map Reduce..

This is the non-standard config that I have on Hive and Spark:

 

hive.cluster.delegation.token.store.class org.apache.hadoop.hive.thrift.DBTokenStore

hive.execution.engine Spark
hive.spark.job.monitor.timeout 9360s
hive.spark.client.connect.timeout 360000ms
hive.spark.client.server.connect.timeout 360000ms
hive.vectorized.execution true
hive.exec.parallel true
hive.exec.parallel.thread.number 8
hive.driver.parallel.compilation  true
hive.optimize.bucketmapjoin.sortedmerge true
hive.compute.query.using.stats true
hive.cbo.enable true
spark.executor.cores 2
hive.spark.dynamic.partition.pruning.map.join.only true
Java Heap Size of Hive Metastore Server in Bytes 2GB (test servers and small data)

 

I believe that is not only the hardware that brings some slow performance but something else that I'm doing wrong. The start VM is very simple, so I used this one to create more complex scenarios to test and learn.

I'll appreciate any help,
BR

0 REPLIES 0