Cloudera Data Analytics (CDA) Articles

VidyaSargur · ‎06-10-2023

Summary

Infra-Solr service exhibits fundamental stability issues after upgrading CDH to CDP.

Sample screenshot of Infra-Solr Health check errors

Sample screenshot of specific Infra-Solr Server Health check errors

Investigation

The Infra-Solr service hosts the ranger_audits collection which is used to display cluster audit information within the Ranger Admin UI. Perform preliminary analysis using Ranger Admin UI - Audits for a single day as demonstrated below. [NOTE: these sample screenshots were taken after resolving the issues; your audit counts will likely be much higher].

Total audits daily count: 136,553,808

Total Impala audits daily count: 5,901,146

Total hbaseregional audits daily count: 1,178,831

Total hbaseregional (access type scanneropen) audits daily count: 0

(due to the complete exclusion of these events)

Total hdfs audits daily count: 128,681,418

Total hdfs (access type liststatus) audits daily count: 0

(due to the complete exclusion of these events)

Assemble and analyze audit counts. The actual pre-resolution values for this case study were:

Total number of Ranger audits - 705,875,710

Application - Impala - 6,719,878

Application - hbaseRegional - 389,896,166

Application - hbaseRegional; Access Type - scannerOpen - 261,735,436

Application - hdfs - 308,644,209

Application - hdfs; Access Type - listStatus - 212,728,345

The total count of Ranger audits (700M) is excessively voluminous. Audit verbosity is a primary contributing factor to Infra-Solr service instability because Ranger Audits are stored within an Infra-Solr collection - ranger_audits, and they are presented within the Ranger Admin UI. Ranger_audits collection is overwhelming Infra-Solr Servers, leading to Web Server Status Unknown / API Liveness check failures.

To reduce audit verbosity, identify meaningful and meaningless events using the Infra-Solr API.

URL examples for reference only:

Query by date/time range

http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?q=evtTime:[2022-02-16T00:00:00...

select all: oldest

http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?q=*:*&sort=evtTime+asc&rows=10...

select all: newest

http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?q=*:*&sort=evtTime+desc&rows=1...

Curl examples for reference only:

> Query by date/time range && number of rows to capture (important)

> -g required to disable globbing of the date range

> This is verbose

curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?q=evtTime:[2022-02-17T00:00:00..." > RangerAuditSolrOutput17Feb22_100000Rows.text

> This is the above query, but narrowing down fewer fields (that you want to see)

curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Cenforcer%2Cagent%2..." > RangerAuditSolrOutput17Feb22_100000Rows.text

> This is the above query, but narrowing down ever fewer fields (that you want to see)

curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[20..." > RangerAuditSolrOutput17Feb22_100000Rows.text

> Select all: oldest

curl --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?q=*:*&sort=evtTime+asc&rows=10..." > RangerAuditSolrOutput17Feb22.text

> Select all: newest

curl --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?q=*:*&sort=evtTime+desc&rows=1..." > RangerAuditSolrOutput17Feb22.text

In this case study, 48 curl commands were executed to get a balanced picture over a 24-hour period, pulling 100,000 audit events every 30 minutes.

NOTE: The Infra-Solr server must render the output; 100,000+ events can easily crash a 30GB Infra-Solr Server. Do not pull any more for that time interval.

curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[20..." > RangerAuditSolrOutput17Feb22_100000Rows_2330-2359.text

curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[20..." > RangerAuditSolrOutput17Feb22_100000Rows_2300-2329.text

curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[20..." > RangerAuditSolrOutput17Feb22_100000Rows_2230-2259.text

curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[20..." > RangerAuditSolrOutput17Feb22_100000Rows_2200-2229.text

..

REPEAT THE COMMANDS WITH RELEVANT EXAMPLES

..

curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[20..." > RangerAuditSolrOutput17Feb22_100000Rows_0130-0159.text

curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[20..." > RangerAuditSolrOutput17Feb22_100000Rows_0100-0129.text

curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[20..." > RangerAuditSolrOutput17Feb22_100000Rows_0030-0059.text

curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[20..." > RangerAuditSolrOutput17Feb22_100000Rows_0000-0029.text

The 48 output files were simply parsed to ascertain the most frequent Ranger audit access types (see the example below when creating your own):

grep access RangerAuditSolrOutput17Feb22* | more | sort -rn | uniq -c | sort -rn | awk -F ' ' '{sum+=$1;}END{print sum;}'

egrep "listStatus|scannerOpen" RangerAuditSolrOutput17Feb22* | more | sort -rn | uniq -c | sort -rn | awk -F ' ' '{sum+=$1;}END{print sum;}'

This example groups audit types by category to assist you in selecting what is meaningful and what is not:

grep access RangerAuditSolrOutput17Feb22_100000Rows.text | more | sort -rn | uniq -c | sort -rn

50517 "access":"listStatus",

23782 "access":"scannerOpen",

14559 "access":"get",

5193 "access":"put",

2081 "access":"open",

1884 "access":"delete",

1394 "access":"WRITE",

336 "access":"rename",

126 "access":"contentSummary",

84 "access":"checkAndPut",

26 "access":"mkdirs",

6 "access":"compactSelection",

5 "access":"flush",

4 "access":"getAclStatus",

3 "access":"compact",

In this case study, up to 1B audit events were being recorded per day with 65-70% sourcing from HDFS - listStatus and HBase - scannerOpen. Such pure metadata operations events were meaningless to DevOps, nevertheless, we verified they were also meaningless to the business before attempting to exclude them. Retain the ‘get’, ‘put’, ‘open’, ‘delete,’ and other key audits.

Assess the Infra-Solr & ranger_audits collection design – Infra-Solr server count, shards, and replicas count – which play an important role in stability. This complimentary document covers those assessment steps: Ranger - Rebuild ranger_audits).

Resolution

Tune Ranger to exclude unwanted event collection.

Edit the cm_hdfs service configuration:

Exclude the ‘listStatus’ audit type from the ‘Audit Filter’ section:

Edit the cm_hbase service configuration:

Exclude the ‘scannerOpen’ audit type from the ‘Audit Filter’ section:

Excluding unmeaningful events provided 3 benefits:

Infra-Solr and the ranger_audits collection stability was greatly improved and facilitated manageability.
Infra-Solr and ranger_audits collection required only 30-35% of the resources to perform the same tasks
Ranger audit history required only 30-35% of HDFS disk space when writing to /ranger/….

Cloudera Community

Cloudera Data Analytics (CDA) Articles

Review and Optimize Ranger Audit Verbosity

Apache Ranger

Summary

Investigation

Resolution