Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Tez mapper Slow Read from S3

avatar
New Contributor
2021-01-29T07:44:33,325  INFO [TezTR-849561_1763_2_0_0_0 (1611687849561_1763_2_00_000000_0)] exec.MapOperator: MAP[0]: records read - 100000
2021-01-29T07:44:35,116  INFO [TezTR-849561_1763_2_0_0_0 (1611687849561_1763_2_00_000000_0)] exec.MapOperator: MAP[0]: records read - 1000000
2021-01-29T07:46:52,194  INFO [TezTR-849561_1763_2_0_0_0 (1611687849561_1763_2_00_000000_0)] exec.MapOperator: MAP[0]: records read - 10000000

TezMapper takes almost 2min+ to read data whn the parquet file size in S3 is more than 10Mn+ 

 

Any suggestion on how to optimize it be faster?

3 REPLIES 3

avatar
Expert Contributor

@ARP  try increasing fs.s3a.connection.maximum to 1500 and follow this doc for the fine S3 tuning parameters.

 

https://docs.cloudera.com/documentation/enterprise/latest/topics/admin_hive_on_s3_tuning.html

avatar
New Contributor

thanks @Prakashcit  looks the issue with timestamp column, we have int96 format timestamp column with millisecond precision and that performance is 10 time slower compared to a parquet with same column as string value or even with stripped of millisecond timestamp column. We are still investigating what's causing this behavior.

 

For example a parquet with 2 columns date_time and value, 

Query with timestamp column with milliseconds

----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... llap SUCCEEDED 10 10 0 0 0 0
Reducer 2 ...... llap SUCCEEDED 1 1 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 179.45 s

 

query with timestamp column value as string or timestamp column without milliseconds

----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... llap SUCCEEDED 9 9 0 0 0 0
Reducer 2 ...... llap SUCCEEDED 1 1 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 13.77 s
----------------------------------------------------------------------------------------------

avatar
Expert Contributor

@ARP  There is a bug https://issues.apache.org/jira/browse/HIVE-24693 

 

Kindly use this work around properties and test your jobs.

 

set hive.parquet.timestamp.time.unit=nanos;

set hive.parquet.write.int64.timestamp=true;