Created on 06-10-2023 03:30 AM - edited on 06-13-2023 02:05 AM by VidyaSargur
Infra-Solr service exhibits fundamental stability issues after upgrading CDH to CDP.
Sample screenshot of Infra-Solr Health check errors
Sample screenshot of specific Infra-Solr Server Health check errors
The Infra-Solr service hosts the ranger_audits collection which is used to display cluster audit information within the Ranger Admin UI. Perform preliminary analysis using Ranger Admin UI - Audits for a single day as demonstrated below. [NOTE: these sample screenshots were taken after resolving the issues; your audit counts will likely be much higher].
Total audits daily count: 136,553,808
Total Impala audits daily count: 5,901,146
Total hbaseregional audits daily count: 1,178,831
Total hbaseregional (access type scanneropen) audits daily count: 0
(due to the complete exclusion of these events)
Total hdfs audits daily count: 128,681,418
Total hdfs (access type liststatus) audits daily count: 0
(due to the complete exclusion of these events)
Assemble and analyze audit counts. The actual pre-resolution values for this case study were:
Total number of Ranger audits - 705,875,710
Application - Impala - 6,719,878
Application - hbaseRegional - 389,896,166
Application - hbaseRegional; Access Type - scannerOpen - 261,735,436
Application - hdfs - 308,644,209
Application - hdfs; Access Type - listStatus - 212,728,345
The total count of Ranger audits (700M) is excessively voluminous. Audit verbosity is a primary contributing factor to Infra-Solr service instability because Ranger Audits are stored within an Infra-Solr collection - ranger_audits, and they are presented within the Ranger Admin UI. Ranger_audits collection is overwhelming Infra-Solr Servers, leading to Web Server Status Unknown / API Liveness check failures.
To reduce audit verbosity, identify meaningful and meaningless events using the Infra-Solr API.
URL examples for reference only:
Query by date/time range select all: oldest select all: newest |
Curl examples for reference only:
> Query by date/time range && number of rows to capture (important) > -g required to disable globbing of the date range > This is verbose curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?q=evtTime:[2022-02-17T00:00:00..." > RangerAuditSolrOutput17Feb22_100000Rows.text > This is the above query, but narrowing down fewer fields (that you want to see) curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Cenforcer%2Cagent%2..." > RangerAuditSolrOutput17Feb22_100000Rows.text > This is the above query, but narrowing down ever fewer fields (that you want to see) curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[20..." > RangerAuditSolrOutput17Feb22_100000Rows.text > Select all: oldest curl --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?q=*:*&sort=evtTime+asc&rows=10..." > RangerAuditSolrOutput17Feb22.text > Select all: newest curl --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?q=*:*&sort=evtTime+desc&rows=1..." > RangerAuditSolrOutput17Feb22.text |
In this case study, 48 curl commands were executed to get a balanced picture over a 24-hour period, pulling 100,000 audit events every 30 minutes.
NOTE: The Infra-Solr server must render the output; 100,000+ events can easily crash a 30GB Infra-Solr Server. Do not pull any more for that time interval.
The 48 output files were simply parsed to ascertain the most frequent Ranger audit access types (see the example below when creating your own):
grep access RangerAuditSolrOutput17Feb22* | more | sort -rn | uniq -c | sort -rn | awk -F ' ' '{sum+=$1;}END{print sum;}' egrep "listStatus|scannerOpen" RangerAuditSolrOutput17Feb22* | more | sort -rn | uniq -c | sort -rn | awk -F ' ' '{sum+=$1;}END{print sum;}' |
This example groups audit types by category to assist you in selecting what is meaningful and what is not:
grep access RangerAuditSolrOutput17Feb22_100000Rows.text | more | sort -rn | uniq -c | sort -rn 50517 "access":"listStatus", 23782 "access":"scannerOpen", 14559 "access":"get", 5193 "access":"put", 2081 "access":"open", 1884 "access":"delete", 1394 "access":"WRITE", 336 "access":"rename", 126 "access":"contentSummary", 84 "access":"checkAndPut", 26 "access":"mkdirs", 6 "access":"compactSelection", 5 "access":"flush", 4 "access":"getAclStatus", 3 "access":"compact", |
In this case study, up to 1B audit events were being recorded per day with 65-70% sourcing from HDFS - listStatus and HBase - scannerOpen. Such pure metadata operations events were meaningless to DevOps, nevertheless, we verified they were also meaningless to the business before attempting to exclude them. Retain the ‘get’, ‘put’, ‘open’, ‘delete,’ and other key audits.
Assess the Infra-Solr & ranger_audits collection design – Infra-Solr server count, shards, and replicas count – which play an important role in stability. This complimentary document covers those assessment steps: Ranger - Rebuild ranger_audits).
Tune Ranger to exclude unwanted event collection.
Edit the cm_hdfs service configuration:
Exclude the ‘listStatus’ audit type from the ‘Audit Filter’ section:
Edit the cm_hbase service configuration:
Exclude the ‘scannerOpen’ audit type from the ‘Audit Filter’ section:
Excluding unmeaningful events provided 3 benefits: