Cloudera Data Analytics (CDA) Articles

Labels (1)
avatar
Cloudera Employee

Summary

Infra-Solr service exhibits fundamental stability issues after upgrading CDH to CDP.

 

Sample screenshot of Infra-Solr Health check errors

MichaelBush_0-1686392301559.png

Sample screenshot of specific Infra-Solr Server Health check errors

MichaelBush_1-1686392301543.png

Investigation

The Infra-Solr service hosts the ranger_audits collection which is used to display cluster audit information within the Ranger Admin UI. Perform preliminary analysis using Ranger Admin UI - Audits for a single day as demonstrated below. [NOTE: these sample screenshots were taken after resolving the issues; your audit counts will likely be much higher].

Total audits daily count: 136,553,808

MichaelBush_2-1686392301577.png


Total Impala audits daily count: 5,901,146

MichaelBush_3-1686392301497.png

 

Total hbaseregional audits daily count: 1,178,831

MichaelBush_4-1686392301582.png

Total hbaseregional (access type scanneropen) audits daily count: 0

(due to the complete exclusion of these events)

MichaelBush_5-1686392301578.png

Total hdfs audits daily count: 128,681,418

MichaelBush_6-1686392301561.png

Total hdfs (access type liststatus) audits daily count: 0

(due to the complete exclusion of these events)

MichaelBush_7-1686392301566.png

Assemble and analyze audit counts. The actual pre-resolution values for this case study were:

     Total number of Ranger audits - 705,875,710

Application - Impala - 6,719,878

Application - hbaseRegional - 389,896,166

     Application - hbaseRegional; Access Type - scannerOpen - 261,735,436

Application - hdfs - 308,644,209

     Application - hdfs; Access Type - listStatus - 212,728,345

The total count of Ranger audits (700M) is excessively voluminous. Audit verbosity is a primary contributing factor to Infra-Solr service instability because Ranger Audits are stored within an Infra-Solr collection - ranger_audits, and they are presented within the Ranger Admin UI. Ranger_audits collection is overwhelming Infra-Solr Servers, leading to Web Server Status Unknown / API Liveness check failures.

 

To reduce audit verbosity, identify meaningful and meaningless events using the Infra-Solr API.

 

URL examples for reference only:

Query by date/time range

http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?q=evtTime:[2022-02-16T00:00:00...

select all: oldest

http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?q=*:*&sort=evtTime+asc&rows=10...

select all: newest

http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?q=*:*&sort=evtTime+desc&rows=1...

Curl examples for reference only:

> Query by date/time range && number of rows to capture (important)

> -g required to disable globbing of the date range

> This is verbose

curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?q=evtTime:[2022-02-17T00:00:00..." > RangerAuditSolrOutput17Feb22_100000Rows.text

> This is the above query, but narrowing down fewer fields (that you want to see)

curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Cenforcer%2Cagent%2..." > RangerAuditSolrOutput17Feb22_100000Rows.text

> This is the above query, but narrowing down ever fewer fields (that you want to see)

curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[20..." > RangerAuditSolrOutput17Feb22_100000Rows.text

> Select all: oldest

curl --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?q=*:*&sort=evtTime+asc&rows=10..." > RangerAuditSolrOutput17Feb22.text

> Select all: newest

curl --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?q=*:*&sort=evtTime+desc&rows=1..." > RangerAuditSolrOutput17Feb22.text

 

In this case study, 48 curl commands were executed to get a balanced picture over a 24-hour period, pulling 100,000 audit events every 30 minutes. 

 

NOTE: The Infra-Solr server must render the output; 100,000+ events can easily crash a 30GB Infra-Solr Server. Do not pull any more for that time interval.

curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[20..." > RangerAuditSolrOutput17Feb22_100000Rows_2330-2359.text

curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[20..." > RangerAuditSolrOutput17Feb22_100000Rows_2300-2329.text

curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[20..." > RangerAuditSolrOutput17Feb22_100000Rows_2230-2259.text

curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[20..." > RangerAuditSolrOutput17Feb22_100000Rows_2200-2229.text

..

REPEAT THE COMMANDS WITH RELEVANT EXAMPLES

..

curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[20..." > RangerAuditSolrOutput17Feb22_100000Rows_0130-0159.text

curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[20..." > RangerAuditSolrOutput17Feb22_100000Rows_0100-0129.text

curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[20..." > RangerAuditSolrOutput17Feb22_100000Rows_0030-0059.text

curl -g --negotiate -u: "http://lannister-005.edh.cloudera.com:18983/solr/ranger_audits/select?fl=access%2Crepo&q=evtTime:[20..." > RangerAuditSolrOutput17Feb22_100000Rows_0000-0029.text

The 48 output files were simply parsed to ascertain the most frequent Ranger audit access types (see the example below when creating your own):

grep access RangerAuditSolrOutput17Feb22* | more | sort -rn | uniq -c | sort -rn | awk -F ' ' '{sum+=$1;}END{print sum;}'


egrep "listStatus|scannerOpen" RangerAuditSolrOutput17Feb22* | more | sort -rn | uniq -c | sort -rn | awk -F ' ' '{sum+=$1;}END{print sum;}'

This example groups audit types by category to assist you in selecting what is meaningful and what is not:

grep access RangerAuditSolrOutput17Feb22_100000Rows.text | more | sort -rn | uniq -c | sort -rn

50517         "access":"listStatus",

23782         "access":"scannerOpen",

14559         "access":"get",

5193         "access":"put",

2081         "access":"open",

1884         "access":"delete",

1394         "access":"WRITE",

 336         "access":"rename",

 126         "access":"contentSummary",

  84         "access":"checkAndPut",

  26         "access":"mkdirs",

   6         "access":"compactSelection",

   5         "access":"flush",

   4         "access":"getAclStatus",

   3         "access":"compact",

In this case study, up to 1B audit events were being recorded per day with 65-70% sourcing from HDFS - listStatus and HBase - scannerOpen. Such pure metadata operations events were meaningless to DevOps, nevertheless, we verified they were also meaningless to the business before attempting to exclude them. Retain the ‘get’, ‘put’, ‘open’, ‘delete,’ and other key audits.

 

Assess the Infra-Solr & ranger_audits collection design – Infra-Solr server count, shards, and replicas count – which play an important role in stability. This complimentary document covers those assessment steps: Ranger - Rebuild ranger_audits).

Resolution

Tune Ranger to exclude unwanted event collection.

Edit the cm_hdfs service configuration:

MichaelBush_8-1686392301568.png

Exclude the ‘listStatus’ audit type from the ‘Audit Filter’ section:

MichaelBush_9-1686392301587.png

Edit the cm_hbase service configuration:

MichaelBush_10-1686392301562.png

Exclude the ‘scannerOpen’ audit type from the ‘Audit Filter’ section:

MichaelBush_11-1686392301583.png

Excluding unmeaningful events provided 3 benefits:

  • Infra-Solr and the ranger_audits collection stability was greatly improved and facilitated manageability.
  • Infra-Solr and ranger_audits collection required only 30-35% of the resources to perform the same tasks
  • Ranger audit history required only 30-35% of HDFS disk space when writing to /ranger/…. 
979 Views
0 Kudos