Reply
Explorer
Posts: 11
Registered: ‎06-21-2017

Impala hit-ratio and profile output inconsitent

We have enabled HDFS caching for our impala tables, however the impala-server.io.mgr.cached-file-handles-hit-ratio is Last (of ): 1. Min: , max: , avg: 0.92 which I beleive implies around 92% of requests are coming from the HDFS cachce, however this does not correlate with the profile as the BytesReadDataNodeCache is zero.

 

How can I deteremine if a given query is actually hitting the cache? and as a given table is showing as cached why is the BytesReadDataNodeCache=0 ?

 

Impala:

Impalad version 2.7.0-cdh5.10.0 RELEASE (build 785a073cd07e2540d521ecebb8b38161ccbd2aa2)

 

 

Impala settings:

IMPALA_SERVER_ARGS=" \

    -log_dir=${IMPALA_LOG_DIR} \

    -catalog_service_host=${IMPALA_CATALOG_SERVICE_HOST} \

    -state_store_port=${IMPALA_STATE_STORE_PORT} \

    -use_statestore \

    -state_store_host=${IMPALA_STATE_STORE_HOST} \

    -max_cached_file_handles=10000 \

    -num_threads_per_disk=10 \

    -num_threads_per_core=10 \

    -num_disks=1 \

    -be_port=${IMPALA_BACKEND_PORT}"

 

HDFS settings:

<property>   

   <name>dfs.datanode.max.locked.memory</name>

    <value>1073741824</value>

  </property>

 

Query: show table stats latest_cached

+--------+--------+--------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+

| #Rows  | #Files | Size   | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                                                   |

+--------+--------+--------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+

| 474477 | 1      | 6.68MB | 6.68MB       | 1                 | PARQUET | false             | hdfs://namenode.ducolabs.co:9000/user/hive/warehouse/demo.db/latest_cached |

+--------+--------+--------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+

Fetched 1 row(s) in 0.11s

 

root@cff9e8ba4e88:/# sudo -u hdfs hdfs cacheadmin -listDirectives  -stats

Found 12 entries

ID POOL           REPL EXPIRY  PATH                                                                 BYTES_NEEDED  BYTES_CACHED  FILES_NEEDED  FILES_CACHED

 

12 one_gig_pool      1 never   /user/hive/warehouse/demo.db/latest_cached                                7008031       7008031             1             1

 

query:

SELECT

COUNT((CASE WHEN ((`latest_cached`.`status` = 'Partial match') AND (`latest_cached`.`roll_up_relationship` <> 'Constituent')) THEN `latest_cached`.`result_id` ELSE CAST(NULL AS STRING) END)) AS `temp_calculation_792000225389371393__1012567293__0_`,

COUNT(DISTINCT (CASE WHEN ((`latest_cached`.`status` <> 'Match') AND (`latest_cached`.`roll_up_relationship` <> 'Constituent')) THEN `latest_cached`.`result_id` ELSE CAST(NULL AS STRING) END)) AS `temp_calculation_792000225390415875__3425135312__0_`,

COUNT((CASE WHEN (`latest_cached`.`roll_up_relationship` <> 'Constituent') THEN `latest_cached`.`result_id` ELSE CAST(NULL AS STRING) END)) AS `temp_calculation_792000225390415875__3848799198__0_`,

COUNT((CASE WHEN ((`latest_cached`.`status` <> 'Match') AND (`latest_cached`.`roll_up_relationship` <> 'Constituent')) THEN `latest_cached`.`result_id` ELSE CAST(NULL AS STRING) END)) AS `temp_temp_tc___1632483046__0___3041165593__0_`,

COUNT(`latest_cached`.`record_id`) AS `temp_temp_tc___1632483046__0___977950011__0_`,

COUNT((CASE WHEN ((`latest_cached`.`input` IN ('External','AWAY', 'Away', 'ENSO', 'EXT', 'POS CDS CITI ')) AND (`latest_cached`.`roll_up_relationship` <> 'Constituent') AND (`latest_cached`.`status` = 'Unmatched')) THEN `latest_cached`.`result_id` ELSE CAST(NULL AS STRING) END)) AS `cnt_calculation_548453999699308545_1_ok`,

COUNT((CASE WHEN ((`latest_cached`.`input` IN ('Internal','HOME', 'HOME ', 'Home', 'MAN', 'MAN CITI CDS')) AND (`latest_cached`.`roll_up_relationship` <> 'Constituent') AND (`latest_cached`.`status` = 'Unmatched')) THEN `latest_cached`.`result_id` ELSE CAST(NULL AS STRING) END)) AS `cnt_calculation_548453999699619842_1_ok`,

COUNT((CASE WHEN ((`latest_cached`.`status` = 'Manual match') AND (`latest_cached`.`roll_up_relationship` <> 'Constituent')) THEN `latest_cached`.`record_id` ELSE CAST(NULL AS INT) END)) AS `cnt_calculation_548453999700000772_1_ok`,

COUNT(DISTINCT (CASE WHEN ((`latest_cached`.`status` <> 'Match') AND (`latest_cached`.`roll_up_relationship` <> 'Constituent')) THEN `latest_cached`.`result_id` ELSE CAST(NULL AS STRING) END)) AS `ctd_calculation_548453999698743296_1_ok`,

`latest_cached`.`process_id` AS `process_id`, TRUNC(`latest_cached`.`update_timestamp`, 'DD') AS `tdy_update_timestamp_ok` FROM `latest_cached` `latest_cached`

GROUP BY 10,

 

Profile:

 

 

           - BytesRead: 4.17 MB (4373047)

           - BytesReadDataNodeCache: 0

      

Cloudera Employee
Posts: 385
Registered: ‎07-29-2015

Re: Impala hit-ratio and profile output inconsitent

impala-server.io.mgr.cached-file-handles* is actually reporting the number of hits in the HDFS file handle cache, which is a different cache from the HDFS data cache. The file handle cache caches metadata like block locations for a file to avoid talking to the HDFS namenode. If you're repeatedly scanning the same files you'll likely get a lot of hits in that cache. 

 

BytesReadDataNodeCache tells you whether it's hitting the data cache. In this case it looks like it isn't. I'm not sure exactly what's going on. It's possible that the file handle cache is interacting with this - if you queried those files earlier, then turned on data caching, then re-queried them, the cached file handles may not have the metadata saying that data caching was enabled for those files.

 

HDFS file handle caching is off by default in Impala 2.7. It looks like you turned it on manually via the command line flag. Is it possible wasn't intentional?

Explorer
Posts: 11
Registered: ‎06-21-2017

Re: Impala hit-ratio and profile output inconsitent

Hi,

 

Thanks so much, they are looking at different things!

 

I turned it on via the -max_cached_file_handles=10000.

 

What I was really trying to do was to have queries hit a file in HDFS cache using the cached in statement however as you can see that BytesReadDataNodeCache is saying it is not hitting the cached data.

 

Any thought on how to identify why impala is not hitting the cached data?

Explorer
Posts: 11
Registered: ‎06-21-2017

Re: Impala hit-ratio and profile output inconsitent

with and without the -max_cached_file_handles=10000 it does not see that impala is hitting the data cache.

 

Any thoughts on how to best trouble shoot why impala does not hit the hdfs cache on queries?

Cloudera Employee
Posts: 385
Registered: ‎07-29-2015

Re: Impala hit-ratio and profile output inconsitent

It may make sense to look at which nodes the fragments are running on
versus the nodes that the blocks are cached on.
Explorer
Posts: 11
Registered: ‎06-21-2017

Re: Impala hit-ratio and profile output inconsitent

HI Tim,

 

I have just paired it down to one name node and one data node and set the cache replication=1, alas any query that is run from impala is still not hitting the cached data although all the table stats say it is cached.

 

Any thoughts welcomed!

Cloudera Employee
Posts: 385
Registered: ‎07-29-2015

Re: Impala hit-ratio and profile output inconsitent

I've been intending to think more about this. It looks like you're doing this right, I'm unsure why it isn't reporting cache reads (there is a probably a reason but it's not apparent to me right now). Just to confirm that the profile is correct, what do you see for the value of impala-server.io-mgr.cached-bytes-read on the metrics page?

Announcements