Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
avatar
Contributor

Migrating HDP clusters to CDP has been a journey many customers are going through these days.

Migrating Ranger Audits + Atlas collections in infra-solr to CDP has been a challenging task.

We hope the steps below will make simplify your journey.

Preparation:

Sample API calls to get the current status of collections in infra-solr. This is an important step to visualise how big these collections are and as a result, get an idea of how long the migration will take.

Note: In the commands below,

  • -k is used if https is used rather than http
  • change http for https if the connection is secure
  • check the port is correct depending on the version of infra-solr being used or if you have customized the port

Infra-Solr API queries

  • Gather the list of collections in infra-solr

 

### To list collections
curl --negotiate -u: -k 'http://solr_host:solr_port/solr/admin/collections?action=LIST'

 

 

  • Get the total number of records in your infra-solr collection. Example, ranger_audits

 

### To get total records in collection along with one entry
curl --negotiate -u: -k 'http://solr_host:solr_port/solr/ranger_audits/query?q=*:*&rows=0'

 

 

  • Get the first and last record of the collection. Example, ranger_audits

 

 ###First record
curl --negotiate -u: -k "http://solr_host:solr_port/solr/ranger_audits/query?q=*:*&rows=1&sort=evtTime%20asc"

###Last record
curl --negotiate -u: -k "http://solr_host:solr_port/solr/ranger_audits/query?q=*:*&rows=1&sort=evtTime%20desc"

 

 

  • Get the number of records per day this will help to estimate the load per day.

 

######audit count for each day Mapped by Day
curl --negotiate -u: -k "http://solr_host:solr_port/solr/ranger_audits/query?q=*:*&rows=1&facet.range=evtTime&facet=true&facet.range.start=NOW/DAY-30DAY&facet.range.end=NOW/DAY&facet.range.gap=%2B1DAY"

 

Approach 1: From Cloudera documentation guide HDP to CDP

hpasumarthi_0-1636373331921.png

  • Pre HDP Upgrade tasks

Backup Ambari Infra Solr

 

### pick a infra-olr server host 
###Upgrade the infra-solr client to latest version
yum upgrade ambari-infra-solr-client -y
export CONFIG_INI_LOCATION=/root/ambari_solr_migration.ini

### Generating the config file 
/usr/bin/python /usr/lib/ambari-infra-solr-client/migrationConfigGenerator.py  --ini-file $CONFIG_INI_LOCATION  --host=ambari_hostname --port=8080  --cluster=hsbcap2  --username=admin  --password=****  --backup-base-path=/root/hdp_solr_backup  --java-home=/usr/lib/jvm/jre-1.8.0-openjdk/

### Backup the infra-solr collections 
/usr/lib/ambari-infra-solr-client/ambariSolrMigration.sh --ini-file $CONFIG_INI_LOCATION --mode backup | tee backup_output.txt 

### Deleting the collections, upgrading the infra-solr clients and servers , restaring the Ranger and Altas which will recreate collections.
/usr/lib/ambari-infra-solr-client/ambariSolrMigration.sh --ini-file $CONFIG_INI_LOCATION --mode delete | tee delete_output.txt

 

  • Post HDP tasks

Ambari infra-migrate and restore

 

###Exporting the config file 
export CONFIG_INI_LOCATION=/root/ambari_solr_migration.ini

### Restoring the collections 
nohup /usr/lib/ambari-infra-solr-client/ambariSolrMigration.sh --ini-file $CONFIG_INI_LOCATION --mode migrate-restore 
nohup /usr/lib/ambari-infra-solr-client/ambariSolrMigration.sh --ini-file $CONFIG_INI_LOCATION --mode transport 

 

Backup Infra Solr collections

 

###Backing up the solr collections
##Exporting the config file 
export CONFIG_INI_LOCATION=/root/ambari_solr_migration-cdp.ini

### Generating the config file 
/usr/bin/python /usr/lib/ambari-infra-solr-client/migrationConfigGenerator.py  --ini-file $CONFIG_INI_LOCATION  --host ambari_hostname --port 8080  --cluster clustername --username admin  --password *****  --backup-base-path /root/hdp_solr_backup_new  --java-home /usr/lib/jvm/jre-1.8.0-openjdk/ --hdfs-base-path /opt/solrdata

#### backing up the solr collections 
/usr/lib/ambari-infra-solr-client/migrationHelper.py --ini-file $CONFIG_INI_LOCATION --action backup

## moving the data to HDFS
/usr/lib/ambari-infra-solr-client/migrationHelper.py --ini-file $CONFIG_INI_LOCATION --action copy-to-hdfs

 

  • Post CDP tasks

amb_posttransition_solr

 

### Exporting the config file 
export CONFIG_INI_LOCATION=/root/ambari_solr_migration-cdp.ini

###Note: Please make sure the ini file is adjusted to CDP solr url, znode location and other properties which is applicable 

#### changing the permissions of HDFS location 
python /root/am2cm/restore_collections.py --ini-file $CONFIG_INI_LOCATION --action change-ownership-in-hdfs

### Deleting the solr collections 
python /root/am2cm/restore_collections.py --ini-file $CONFIG_INI_LOCATION --action delete-new-solr-collections

## Restoring the solr collections
python /root/am2cm/restore_collections.py --ini-file $CONFIG_INI_LOCATION --action full-restore

 

Approach 2: Backing up infra-solr data before your and restoring them after CDP upgrade (Speed - 1 million records in 7min)

  • Backing up the collections in infra-solr using the solrDataManager.py script.
  • This approach will take a backup of 0.1 million records and delete them from infra-solr. This will help offloading data in the infra-solr as we progress with backup.
  • This script can be enhanced to only save, archive, or delete mode.
  • Adjust the END_DATE accordingly if you wish to run the script multiple times.
  • The average speed at which records will be backed up is 1 million records in 7 mins
  • Run in nohup mode for collections with records more than 10 million records.

hpasumarthi_1-1636374869191.png

Step 1:

Taking the backup of the infra-solr ranger_audits collection can be run anytime even multiple times before the Ambari upgrade. Please use END_DATE accordingly to take the backup.

 

### shell script name collection_local.sh
# Init values:
SOLR_URL=http://solr_host:solr_port/solr

END_DATE=2021-06-25T12:00:00.000Z

OLD_COLLECTION=ranger_audits
LOCAL_PATH=/home/solr/backup/ranger/ranger_audits/data
EXCLUDE_FIELDS=_version_ 
  
# comma separated exclude fields, at least _version_ is required

# provide these with -k and -n options only if kerberos is enabled for Infra Solr !!!
INFRA_SOLR_KEYTAB=/etc/security/keytabs/ambari-infra-solr.service.keytab
INFRA_SOLR_PRINCIPAL=infra-solr/$(hostname -f)@REALM

DATE_FIELD=evtTime

# -m MODE, --mode=MODE  archive | delete | save
MODE=archive

/usr/lib/ambari-infra-solr-client/solrDataManager.py -m $MODE -v -c $OLD_COLLECTION -s $SOLR_URL -z none -r 100000 -w 100000 -f $DATE_FIELD -e $END_DATE -x $LOCAL_PATH -k $INFRA_SOLR_KEYTAB -n $INFRA_SOLR_PRINCIPAL --exclude-fields $EXCLUDE_FIELDS

 

Note: EXCLUDE_FIELDS is not available in the infra-solr scripts that come with Ambari 2.6.*. Please upgrade your infra-solr client or remove EXCLUDE from the script.
## i.e remove --exclude-fields $EXCLUDE_FIELDS

Step 2:

During the Ambari wizard upgrade of HDP to the Ambari-managed interim HDP-7.1.x version of CDP, you do not need to backup or restore collections.

Step 3:

After transitioning to Cloudera Manager running CDP and once all the services are started, we can trigger the restore script.

  • Ensure the Ranger collection is created before running this script.
  • The speed of the restore is around 1 million every 3 minutes

 

#Saving data from solr:
# Init values:
SOLR_URL=http://solr_host:solr_port/solr

COLLECTION=ranger_audits
DIR_NAME=/home/solr/backup/ranger/ranger_audits/data

# provide these with -k and -n options only if kerberos is enabled for Infra Solr !!!
INFRA_SOLR_KEYTAB=/etc/security/keytabs/ambari-infra-solr.service.keytab
INFRA_SOLR_PRINCIPAL=infra-solr/$(hostname -f)@REALM

for FILE_NAME in $DIR_NAME/*.json
do
	echo "Uploading file to solr - $FILE_NAME"
	curl -k --negotiate -u : -H "Content-type:application/json" "$SOLR_URL/$COLLECTION/update/json/docs?commit=true&wt=json" --data-binary @$FILE_NAME
done

 

Approach 3: Dumping the infra-solr collections before HDP upgrade and restoring them after CDP upgrade (Speed 1 million records in 1 min) 

  • Backing up the collections in infra-solr using the solrCloudCli.sh script
  • The average speed at which records will be backed up is ~ 1 million records/minute
  • Run in nohup mode for collections with records more than 15 million records (or less to ensure the script doesn't terminate if you lose connectivity)
  • This script only works if your ambari Infra-solr client is v 2.7.x or higher.

hpasumarthi_2-1636375432759.png

Step 1:

Take the dump of the infra-solr collections before starting the Ambari upgrade.

 

 

## Getting Kerberos ticket 
klist -kt /etc/security/keytabs/ambari-infra-solr.service.keytab infra-solr/`hostname -f`@REALM

### Taking Dump of your collections e.g ranger_audits (speed 1min/million records)

/usr/lib/ambari-infra-solr-client/solrCloudCli.sh --zookeeper-connect-string zookeeper_host:2181/infra-solr --jaas-file /etc/ambari-infra-solr/conf/infra_solr_jaas.conf --dump-documents --collection ranger_audits --output /home/solr/backup/ranger/ranger_audits/data --max-read-block-size 100000 --max-write-block-size 100000 


### Running in backgroud using nohup
nohup /usr/lib/ambari-infra-solr-client/solrCloudCli.sh --zookeeper-connect-string zookeeper_host:2181/infra-solr --jaas-file /etc/ambari-infra-solr/conf/infra_solr_jaas.conf --dump-documents --collection ranger_audits --output /home/solr/backup/ranger/ranger_audits/data --max-read-block-size 100000 --max-write-block-size 100000 2>&1 > /home/solr/backup/ranger/backup_ranger_audits.log &

 

 

Note: In this option, we are taking the complete backup of infra-solr collections and can then remove collections and data from infra-solr. A simple restart of Ranger Admin/Atlas will create new empty collections in infra-solr.

Step 2:

During the Ambari managed HDP upgrade steps, we do not need to backup or restore collections.

Step 3:

After migrating to CDP and once all the services are started we can trigger the restore script.

  • Make sure the ranger collection is created before running this script.
  • The speed at the restores the data at speed ~ 1 million records in 1 minute.

 

## Getting Kerberos ticket 
klist -kt /etc/security/keytabs/ambari-infra-solr.service.keytab infra-solr/`hostname -f`@REALM

### Restoring your collections e.g ranger_audits (speed 1min/million records)

/usr/lib/ambari-infra-solr-client/solrCloudCli.sh --zookeeper-connect-string zookeeper_host:2181/solr-infra --jaas-file /etc/ambari-infra-solr/conf/infra_solr_jaas.conf --upload-documents --collection ranger_audits --output /home/solr/backup/ranger/ranger_audits/data --max-read-block-size 100000 --max-write-block-size 100000 


### Running in backgroud using nohup
nohup /usr/lib/ambari-infra-solr-client/solrCloudCli.sh --zookeeper-connect-string zookeeper_host:2181/solr-infra --jaas-file /etc/ambari-infra-solr/conf/infra_solr_jaas.conf --upload-documents --collection ranger_audits --output /home/solr/backup/ranger/ranger_audits/data --max-read-block-size 100000 --max-write-block-size 100000 2>&1 > /home/solr/backup/ranger/restore_ranger_audits.log &​

 

Summary

We have been using:

  • Approach 1 for Development clusters with significantly less audit data in infra-solr
  • Approach 2 for HDP 2.6.5 clusters
  • Approach 3 for HDP 3.1.5 clusters

===========================================================================

We recommend:

  • Approach 1 for clusters with less than 10 million records since it is easy to backup and restore as per Cloudera upgrade documentation
  • Approach 2 is slow when compared to other approaches (~ 1 million records in 7 minutes) but useful since:
    • When used in "archive" mode, it will clean up the data which it has backed up
    • One can choose the END_DATE, if required, to re-run the script multiple times before the upgrade date
  • Approach 3 for clusters with more than 10 million records due to its efficient way to dump the documents and restore them after upgrading to CDP. Note the downsides of Approach 3 are:
    • It will only work if the ambari-infra-solr client is v 2.7.x or higher
    • If your backup fails, you have to restart the entire script and backup from the start again (i.e. it doesn't delete from the collections as it goes)

This summary is written based on our experience working with HDP clusters that are being migrated to CDP. Please use the development environments on your estate to come up with your own estimates and choose the right option which suits your clusters and SLAs.

Thank you !!

1,161 Views
0 Kudos