Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Tez mapper Slow Read from S3

New Contributor
2021-01-29T07:44:33,325  INFO [TezTR-849561_1763_2_0_0_0 (1611687849561_1763_2_00_000000_0)] exec.MapOperator: MAP[0]: records read - 100000
2021-01-29T07:44:35,116  INFO [TezTR-849561_1763_2_0_0_0 (1611687849561_1763_2_00_000000_0)] exec.MapOperator: MAP[0]: records read - 1000000
2021-01-29T07:46:52,194  INFO [TezTR-849561_1763_2_0_0_0 (1611687849561_1763_2_00_000000_0)] exec.MapOperator: MAP[0]: records read - 10000000

TezMapper takes almost 2min+ to read data whn the parquet file size in S3 is more than 10Mn+ 

 

Any suggestion on how to optimize it be faster?

3 REPLIES 3

Contributor

@ARP  try increasing fs.s3a.connection.maximum to 1500 and follow this doc for the fine S3 tuning parameters.

 

https://docs.cloudera.com/documentation/enterprise/latest/topics/admin_hive_on_s3_tuning.html

New Contributor

thanks @Prakashcit  looks the issue with timestamp column, we have int96 format timestamp column with millisecond precision and that performance is 10 time slower compared to a parquet with same column as string value or even with stripped of millisecond timestamp column. We are still investigating what's causing this behavior.

 

For example a parquet with 2 columns date_time and value, 

Query with timestamp column with milliseconds

----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... llap SUCCEEDED 10 10 0 0 0 0
Reducer 2 ...... llap SUCCEEDED 1 1 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 179.45 s

 

query with timestamp column value as string or timestamp column without milliseconds

----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... llap SUCCEEDED 9 9 0 0 0 0
Reducer 2 ...... llap SUCCEEDED 1 1 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 13.77 s
----------------------------------------------------------------------------------------------

Contributor

@ARP  There is a bug https://issues.apache.org/jira/browse/HIVE-24693 

 

Kindly use this work around properties and test your jobs.

 

set hive.parquet.timestamp.time.unit=nanos;

set hive.parquet.write.int64.timestamp=true;

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.