In this article I'd
like to focus on how to automatically remove
Solr documents from a collection. I ran into a lot of questions and
issues recently that all came back to TTL (time-to-live) of documents, e.g.
nodes running out of space
company policies requiring to remove old audit
a new UpdateProcesser called DocExpirationUpdateProcessorFactory, which allows
us to add an expiration date to solr documents and making sure expired
documents are removed automatically.
How does it work?
document will get an expiration field, which contains the date and time at
which the document will "expire". This field is calculated relative
to the time it is indexed as well as by using the configuration provided by a
field called _ttle_ (this is the default
name, you can always rename it if you want). _ttl_ sets the lifetime of
documents, e.g. +10DAYS, +2WEEKS,+4HOURS,...
Current Time is:
_ttl_ is defined as:
This will result in
an expiration value of 2016-10-26 22:14:00
The deleteByQuery is
triggered as often as configured with autoDeletePeriodSeconds, e.g. a value of
86400 would trigger a background thread that is executing deleteByQueries on a
daily basis (86400 seconds = 1 day). These queries will delete all documents where the expiration timestamp is in the past (relative to NOW or the time the thread was started).
If you want
to customize the delete procedure, you can use autoDeleteChainName to configure
your own updateRequestProcessorChain, which is used for all the deletes.
Once the removal process is
finished, a soft commit is triggered and the documents wont appear in any
In the next section
I show some use cases and examples that I have seen more often recently.
First, lets look at
a general example that uses TTL, for this we are going to use the films collection (we
have used that in other articles as well, so I wont go much into detail). Movies will be stored in a Solr Collection, however we don't want to keep movies
more than 10days!
Go to the first node of your Solr Cloud (doesn't really
matter which node, but we need the zkcli client)
Create the initial
Solr Collection configuration by using the basic_config,
which is part of every Solr installation.
Solr Audits (Ambari Infra or custom Solr Cloud)
Ranger Audits can be
stored in a custom SolrCloud or the one that is provided by Ambari Infra.
Ambari Infra is a
new service that includes its own Solr instances, e.g. to store Ranger audits or Atlas details.
Since HDP 2.5 we have officially moved away from Audits to DB and moved to Solr. Solr (as well as DB) is only a short-term storage when it comes to Ranger
audits, basically its only used for the audit information displayed in the
Ranger Admin UI. Long-term archival of audits should be stored in HDFS or
By default, the
Ranger Solr Audit Collection comes with a pre-configured TTL, so all the Ranger
Audits in Solr will be deleted after 90 days out of the box.
What happens if you
only want to store audit logs for 30 days or one week? Take a look at the next
Installation; Solr Audits = disabled
If you haven't used
Solr Audits before and haven't enabled Ranger Audits to Solr via Ambari yet, it will
be easy to adjust the TTL configuration. Go to your Ranger Admin node and execute the
This will reduce the
time we keep audits in Solr to 30 days:
sed -i 's/+90DAYS/+30DAYS/g' /usr/hdp/188.8.131.52-1245/ranger-admin/contrib/solr_for_audit_setup/conf/solrconfig.xml
Afterwards, you can
go to Ambari and enable Ranger Solr Audits, the collection that is going to be
created will use the new setting.
In case you have
already enabled the Ranger Audit logs to Solr and have already collected plenty of
documents in that Solr Collection, you can adjust the TTL with the following steps. However, its
important to keep in mind that this does not affect old documents, only new
Go to one of the
Ambari Infra nodes that hosts a Solr Instance (again, any node with the zkcli