Support Questions

ARP · ‎01-29-2021

2021-01-29T07:44:33,325  INFO [TezTR-849561_1763_2_0_0_0 (1611687849561_1763_2_00_000000_0)] exec.MapOperator: MAP[0]: records read - 100000
2021-01-29T07:44:35,116  INFO [TezTR-849561_1763_2_0_0_0 (1611687849561_1763_2_00_000000_0)] exec.MapOperator: MAP[0]: records read - 1000000
2021-01-29T07:46:52,194  INFO [TezTR-849561_1763_2_0_0_0 (1611687849561_1763_2_00_000000_0)] exec.MapOperator: MAP[0]: records read - 10000000

TezMapper takes almost 2min+ to read data whn the parquet file size in S3 is more than 10Mn+

Any suggestion on how to optimize it be faster?

Prakashcit · ‎02-16-2021

@ARP try increasing fs.s3a.connection.maximum to 1500 and follow this doc for the fine S3 tuning parameters.

https://docs.cloudera.com/documentation/enterprise/latest/topics/admin_hive_on_s3_tuning.html

ARP · ‎02-17-2021

thanks @Prakashcit looks the issue with timestamp column, we have int96 format timestamp column with millisecond precision and that performance is 10 time slower compared to a parquet with same column as string value or even with stripped of millisecond timestamp column. We are still investigating what's causing this behavior.

For example a parquet with 2 columns date_time and value,

Query with timestamp column with milliseconds

----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... llap SUCCEEDED 10 10 0 0 0 0
Reducer 2 ...... llap SUCCEEDED 1 1 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 179.45 s

query with timestamp column value as string or timestamp column without milliseconds

----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... llap SUCCEEDED 9 9 0 0 0 0
Reducer 2 ...... llap SUCCEEDED 1 1 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 13.77 s
----------------------------------------------------------------------------------------------

Prakashcit · ‎02-22-2021

@ARP There is a bug https://issues.apache.org/jira/browse/HIVE-24693

Kindly use this work around properties and test your jobs.

set hive.parquet.timestamp.time.unit=nanos;

set hive.parquet.write.int64.timestamp=true;

Cloudera Community

Support Questions

Tez mapper Slow Read from S3