Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Impala hit-ratio and profile output inconsitent

Highlighted

Impala hit-ratio and profile output inconsitent

Explorer

We have enabled HDFS caching for our impala tables, however the impala-server.io.mgr.cached-file-handles-hit-ratio is Last (of ): 1. Min: , max: , avg: 0.92 which I beleive implies around 92% of requests are coming from the HDFS cachce, however this does not correlate with the profile as the BytesReadDataNodeCache is zero.

 

How can I deteremine if a given query is actually hitting the cache? and as a given table is showing as cached why is the BytesReadDataNodeCache=0 ?

 

Impala:

Impalad version 2.7.0-cdh5.10.0 RELEASE (build 785a073cd07e2540d521ecebb8b38161ccbd2aa2)

 

 

Impala settings:

IMPALA_SERVER_ARGS=" \

    -log_dir=${IMPALA_LOG_DIR} \

    -catalog_service_host=${IMPALA_CATALOG_SERVICE_HOST} \

    -state_store_port=${IMPALA_STATE_STORE_PORT} \

    -use_statestore \

    -state_store_host=${IMPALA_STATE_STORE_HOST} \

    -max_cached_file_handles=10000 \

    -num_threads_per_disk=10 \

    -num_threads_per_core=10 \

    -num_disks=1 \

    -be_port=${IMPALA_BACKEND_PORT}"

 

HDFS settings:

<property>   

   <name>dfs.datanode.max.locked.memory</name>

    <value>1073741824</value>

  </property>

 

Query: show table stats latest_cached

+--------+--------+--------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+

| #Rows  | #Files | Size   | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                                                   |

+--------+--------+--------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+

| 474477 | 1      | 6.68MB | 6.68MB       | 1                 | PARQUET | false             | hdfs://namenode.ducolabs.co:9000/user/hive/warehouse/demo.db/latest_cached |

+--------+--------+--------+--------------+-------------------+---------+-------------------+----------------------------------------------------------------------------+

Fetched 1 row(s) in 0.11s

 

root@cff9e8ba4e88:/# sudo -u hdfs hdfs cacheadmin -listDirectives  -stats

Found 12 entries

ID POOL           REPL EXPIRY  PATH                                                                 BYTES_NEEDED  BYTES_CACHED  FILES_NEEDED  FILES_CACHED

 

12 one_gig_pool      1 never   /user/hive/warehouse/demo.db/latest_cached                                7008031       7008031             1             1

 

query:

SELECT

COUNT((CASE WHEN ((`latest_cached`.`status` = 'Partial match') AND (`latest_cached`.`roll_up_relationship` <> 'Constituent')) THEN `latest_cached`.`result_id` ELSE CAST(NULL AS STRING) END)) AS `temp_calculation_792000225389371393__1012567293__0_`,

COUNT(DISTINCT (CASE WHEN ((`latest_cached`.`status` <> 'Match') AND (`latest_cached`.`roll_up_relationship` <> 'Constituent')) THEN `latest_cached`.`result_id` ELSE CAST(NULL AS STRING) END)) AS `temp_calculation_792000225390415875__3425135312__0_`,

COUNT((CASE WHEN (`latest_cached`.`roll_up_relationship` <> 'Constituent') THEN `latest_cached`.`result_id` ELSE CAST(NULL AS STRING) END)) AS `temp_calculation_792000225390415875__3848799198__0_`,

COUNT((CASE WHEN ((`latest_cached`.`status` <> 'Match') AND (`latest_cached`.`roll_up_relationship` <> 'Constituent')) THEN `latest_cached`.`result_id` ELSE CAST(NULL AS STRING) END)) AS `temp_temp_tc___1632483046__0___3041165593__0_`,

COUNT(`latest_cached`.`record_id`) AS `temp_temp_tc___1632483046__0___977950011__0_`,

COUNT((CASE WHEN ((`latest_cached`.`input` IN ('External','AWAY', 'Away', 'ENSO', 'EXT', 'POS CDS CITI ')) AND (`latest_cached`.`roll_up_relationship` <> 'Constituent') AND (`latest_cached`.`status` = 'Unmatched')) THEN `latest_cached`.`result_id` ELSE CAST(NULL AS STRING) END)) AS `cnt_calculation_548453999699308545_1_ok`,

COUNT((CASE WHEN ((`latest_cached`.`input` IN ('Internal','HOME', 'HOME ', 'Home', 'MAN', 'MAN CITI CDS')) AND (`latest_cached`.`roll_up_relationship` <> 'Constituent') AND (`latest_cached`.`status` = 'Unmatched')) THEN `latest_cached`.`result_id` ELSE CAST(NULL AS STRING) END)) AS `cnt_calculation_548453999699619842_1_ok`,

COUNT((CASE WHEN ((`latest_cached`.`status` = 'Manual match') AND (`latest_cached`.`roll_up_relationship` <> 'Constituent')) THEN `latest_cached`.`record_id` ELSE CAST(NULL AS INT) END)) AS `cnt_calculation_548453999700000772_1_ok`,

COUNT(DISTINCT (CASE WHEN ((`latest_cached`.`status` <> 'Match') AND (`latest_cached`.`roll_up_relationship` <> 'Constituent')) THEN `latest_cached`.`result_id` ELSE CAST(NULL AS STRING) END)) AS `ctd_calculation_548453999698743296_1_ok`,

`latest_cached`.`process_id` AS `process_id`, TRUNC(`latest_cached`.`update_timestamp`, 'DD') AS `tdy_update_timestamp_ok` FROM `latest_cached` `latest_cached`

GROUP BY 10,

 

Profile:

 

 

           - BytesRead: 4.17 MB (4373047)

           - BytesReadDataNodeCache: 0

      

6 REPLIES 6

Re: Impala hit-ratio and profile output inconsitent

Master Collaborator

impala-server.io.mgr.cached-file-handles* is actually reporting the number of hits in the HDFS file handle cache, which is a different cache from the HDFS data cache. The file handle cache caches metadata like block locations for a file to avoid talking to the HDFS namenode. If you're repeatedly scanning the same files you'll likely get a lot of hits in that cache. 

 

BytesReadDataNodeCache tells you whether it's hitting the data cache. In this case it looks like it isn't. I'm not sure exactly what's going on. It's possible that the file handle cache is interacting with this - if you queried those files earlier, then turned on data caching, then re-queried them, the cached file handles may not have the metadata saying that data caching was enabled for those files.

 

HDFS file handle caching is off by default in Impala 2.7. It looks like you turned it on manually via the command line flag. Is it possible wasn't intentional?

Re: Impala hit-ratio and profile output inconsitent

Explorer

Hi,

 

Thanks so much, they are looking at different things!

 

I turned it on via the -max_cached_file_handles=10000.

 

What I was really trying to do was to have queries hit a file in HDFS cache using the cached in statement however as you can see that BytesReadDataNodeCache is saying it is not hitting the cached data.

 

Any thought on how to identify why impala is not hitting the cached data?

Re: Impala hit-ratio and profile output inconsitent

Explorer

with and without the -max_cached_file_handles=10000 it does not see that impala is hitting the data cache.

 

Any thoughts on how to best trouble shoot why impala does not hit the hdfs cache on queries?

Re: Impala hit-ratio and profile output inconsitent

Master Collaborator
It may make sense to look at which nodes the fragments are running on
versus the nodes that the blocks are cached on.

Re: Impala hit-ratio and profile output inconsitent

Explorer

HI Tim,

 

I have just paired it down to one name node and one data node and set the cache replication=1, alas any query that is run from impala is still not hitting the cached data although all the table stats say it is cached.

 

Any thoughts welcomed!

Re: Impala hit-ratio and profile output inconsitent

Master Collaborator

I've been intending to think more about this. It looks like you're doing this right, I'm unsure why it isn't reporting cache reads (there is a probably a reason but it's not apparent to me right now). Just to confirm that the profile is correct, what do you see for the value of impala-server.io-mgr.cached-bytes-read on the metrics page?