Created 01-29-2021 08:59 AM
2021-01-29T07:44:33,325 INFO [TezTR-849561_1763_2_0_0_0 (1611687849561_1763_2_00_000000_0)] exec.MapOperator: MAP[0]: records read - 100000 2021-01-29T07:44:35,116 INFO [TezTR-849561_1763_2_0_0_0 (1611687849561_1763_2_00_000000_0)] exec.MapOperator: MAP[0]: records read - 1000000 2021-01-29T07:46:52,194 INFO [TezTR-849561_1763_2_0_0_0 (1611687849561_1763_2_00_000000_0)] exec.MapOperator: MAP[0]: records read - 10000000
TezMapper takes almost 2min+ to read data whn the parquet file size in S3 is more than 10Mn+
Any suggestion on how to optimize it be faster?
Created 02-16-2021 02:22 AM
@ARP try increasing fs.s3a.connection.maximum to 1500 and follow this doc for the fine S3 tuning parameters.
https://docs.cloudera.com/documentation/enterprise/latest/topics/admin_hive_on_s3_tuning.html
Created on 02-17-2021 09:27 AM - edited 02-17-2021 10:25 AM
thanks @Prakashcit looks the issue with timestamp column, we have int96 format timestamp column with millisecond precision and that performance is 10 time slower compared to a parquet with same column as string value or even with stripped of millisecond timestamp column. We are still investigating what's causing this behavior.
For example a parquet with 2 columns date_time and value,
Query with timestamp column with milliseconds
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... llap SUCCEEDED 10 10 0 0 0 0
Reducer 2 ...... llap SUCCEEDED 1 1 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 179.45 s
query with timestamp column value as string or timestamp column without milliseconds
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... llap SUCCEEDED 9 9 0 0 0 0
Reducer 2 ...... llap SUCCEEDED 1 1 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 13.77 s
----------------------------------------------------------------------------------------------
Created 02-22-2021 01:48 AM
@ARP There is a bug https://issues.apache.org/jira/browse/HIVE-24693
Kindly use this work around properties and test your jobs.
set hive.parquet.timestamp.time.unit=nanos;
set hive.parquet.write.int64.timestamp=true;