Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (2)

In this article I'd like to focus on how to automatically remove Solr documents from a collection. I ran into a lot of questions and issues recently that all came back to TTL (time-to-live) of documents, e.g.

  • nodes running out of space
  • company policies requiring to remove old audit logs
  • auto-purging etc.

SOLR-5795 introduced a new UpdateProcesser called DocExpirationUpdateProcessorFactory, which allows us to add an expiration date to solr documents and making sure expired documents are removed automatically.

How does it work?

Every indexed document will get an expiration field, which contains the date and time at which the document will "expire". This field is calculated relative to the time it is indexed as well as by using the configuration provided by a field called _ttle_ (this is the default name, you can always rename it if you want). _ttl_ sets the lifetime of documents, e.g. +10DAYS, +2WEEKS,+4HOURS,...

For example:

Current Time is: 2016-10-26 20:14:00

_ttl_ is defined as: +2HOURS

This will result in an expiration value of 2016-10-26 22:14:00

The deleteByQuery is triggered as often as configured with autoDeletePeriodSeconds, e.g. a value of 86400 would trigger a background thread that is executing deleteByQueries on a daily basis (86400 seconds = 1 day). These queries will delete all documents where the expiration timestamp is in the past (relative to NOW or the time the thread was started).

If you want to customize the delete procedure, you can use autoDeleteChainName to configure your own updateRequestProcessorChain, which is used for all the deletes.

Once the removal process is finished, a soft commit is triggered and the documents wont appear in any search anymore.

In the next section I show some use cases and examples that I have seen more often recently.

Solr in General

First, lets look at a general example that uses TTL, for this we are going to use the films collection (we have used that in other articles as well, so I wont go much into detail). Movies will be stored in a Solr Collection, however we don't want to keep movies more than 10days!

Go to the first node of your Solr Cloud (doesn't really matter which node, but we need the zkcli client)

Create the initial Solr Collection configuration by using the basic_config, which is part of every Solr installation.

mkdir /opt/lucidworks-hdpsearch/solr_collections
mkdir /opt/lucidworks-hdpsearch/solr_collections/films 
chown -R solr:solr /opt/lucidworks-hdpsearch/solr_collections 
cp -R /opt/lucidworks-hdpsearch/solr/server/solr/configsets/basic_configs/conf /opt/lucidworks-hdpsearch/solr_collections/films

Adjust schema.xml (/opt/lucidworks-hdpsearch/solr_collections/films/conf)

Add the following field definitions in the schema.xml file (There are already some base field definitions, simply copy-and-paste the following 4 lines somewhere nearby).

<field name="directed_by" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="name" type="text_general" indexed="true" stored="true"/>
<field name="initial_release_date" type="string" indexed="true" stored="true"/>
<field name="genre" type="string" indexed="true" stored="true" multiValued="true"/>

We also add the following fields for the auto-purging:

<field name="_ttl_" type="string" indexed="true" multiValued="false" stored="true" />
<field name="expire_at" type="date" multiValued="false" indexed="true" stored="true" />

_ttl_ = Amount of time this document should be kept (e.g. +10DAYS)

expire_at = The calculated expiration date (INDEX_TIME + _ttl_)

Adjust solrconfig.xml

In order for the expiration date to be calculated, we have to add the new DocExpirationUpdateProcessorFactory

<updateRequestProcessorChain name="add-unknown-fields-to-the-schema">
	<processor>
		<str name="fieldName">_ttl_</str>
		<str name="value">+30DAYS</str>
	</processor>
	<processor>
		<int name="autoDeletePeriodSeconds">300</int>
		<str name="ttlFieldName">_ttl_</str>
		<str name="expirationFieldName">expire_at</str>
	</processor>
	<processor class="solr.LogUpdateProcessorFactory" />
	<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

Also make sure the processor chain is triggered with every update:

<initParamspath="/update/**,/query,/select,/tvrh,/elevate,/spell">
	<lst name="defaults">
		<str name="df">text</str>
		<str name="update.chain">add-unknown-fields-to-the-schema</str>
	</lst>
</initParams>

Upload the config, create the collection and index the sample data

/opt/lucidworks-hdpsearch/solr/server/scripts/cloud-scripts/zkcli.sh -zkhost horton0.example.com:2181/solr -cmd upconfig -confname films -confdir /opt/lucidworks-hdpsearch/solr_collections/films/conf

curl --negotiate -u : "http://horton0.example.com:8983/solr/admin/collections?action=CREATE&name=films&numShards=1"

curl --negotiate -u : ' http://horton0.example.com:8983/solr/films/update/json?commit=true' --data-binary @/opt/lucidworks-hdpsearch/solr/example/films/films.json -H 'Content-type:application/json'

Select a single document from the Films-Collection

curl --negotiate -u : http://horton0.example.com:8983/solr/films/select?q=*&start=0&rows=1&wt=json

Result:

{
   "id":"/en/45_2006",
   "directed_by":[
      "Gary Lennon"
   ],
   "initial_release_date":"2006-11-30",
   "genre":[
      "Black comedy",
      "Thriller",
      "Psychological thriller",
      "Indie film",
      "Action Film",
      "Crime Thriller",
      "Crime Fiction",
      "Drama"
   ],
   "name":".45",
   "_ttl_":"+10DAYS",
   "expire_at":"2016-11-06T05:46:46.565Z",
   "_version_":1549320539674247200
}

Ranger Solr Audits (Ambari Infra or custom Solr Cloud)

Ranger Audits can be stored in a custom SolrCloud or the one that is provided by Ambari Infra.

Ambari Infra is a new service that includes its own Solr instances, e.g. to store Ranger audits or Atlas details. Since HDP 2.5 we have officially moved away from Audits to DB and moved to Solr. Solr (as well as DB) is only a short-term storage when it comes to Ranger audits, basically its only used for the audit information displayed in the Ranger Admin UI. Long-term archival of audits should be stored in HDFS or something similar.

By default, the Ranger Solr Audit Collection comes with a pre-configured TTL, so all the Ranger Audits in Solr will be deleted after 90 days out of the box.

What happens if you only want to store audit logs for 30 days or one week? Take a look at the next paragraphs :)

New Installation; Solr Audits = disabled

If you haven't used Solr Audits before and haven't enabled Ranger Audits to Solr via Ambari yet, it will be easy to adjust the TTL configuration. Go to your Ranger Admin node and execute the following command:

This will reduce the time we keep audits in Solr to 30 days:

sed -i 's/+90DAYS/+30DAYS/g' /usr/hdp/2.5.0.0-1245/ranger-admin/contrib/solr_for_audit_setup/conf/solrconfig.xml

Afterwards, you can go to Ambari and enable Ranger Solr Audits, the collection that is going to be created will use the new setting.

Sample Audit Log

{
   "id":"5519e650-440b-4c14-ace5-c1b79ee9f3d5-47734",
   "access":"READ_EXECUTE",
   "enforcer":"hadoop-acl",
   "repo":"bigdata_hadoop",
   "reqUser":"mapred",
   "resource":"/mr-history/tmp",
   "cliIP":"127.0.0.1",
   "logType":"RangerAudit",
   "result":1,
   "policy":-1,
   "repoType":1,
   "resType":"path",
   "reason":"/mr-history/tmp",
   "action":"read",
   "evtTime":"2016-10-26T05:14:21.686Z",
   "seq_num":71556,
   "event_count":1,
   "event_dur_ms":0,
   "_ttl_":"+30DAYS",
   "_expire_at_":"2016-11-25T05:14:23.107Z",
   "_version_":1549227904852820000
}

As you can see, the new _ttl_ is 30 DAYS

Old Installation; Solr Audits = enabled

In case you have already enabled the Ranger Audit logs to Solr and have already collected plenty of documents in that Solr Collection, you can adjust the TTL with the following steps. However, its important to keep in mind that this does not affect old documents, only new ones.

Go to one of the Ambari Infra nodes that hosts a Solr Instance (again, any node with the zkcli client)

Download the solrconfig.xml from Zookeeper

/usr/lib/ambari-infra-solr/server/scripts/cloud-scripts/zkcli.sh --zkhost horton0.example.com:2181 -cmd getfile /infra-solr/configs/ranger_audits/solrconfig.xml solrconfig.xml

Edit the file or use sed to replace the 90 Days in the solrconfig.xml

sed -i 's/+90DAYS/+14DAYS/g' solrconfig.xml

Upload the config back to Zookeeper

/usr/lib/ambari-infra-solr/server/scripts/cloud-scripts/zkcli.sh --zkhost horton0.example.com:2181 -cmd putfile /infra-solr/configs/ranger_audits/solrconfig.xml solrconfig.xml

Reload the config

curl -v --negotiate -u : "http://horton0.example.com:8983/solr/admin/collections?action=RELOAD&name=ranger_audits"

Sample Audit Log

{
   "id":"5519e650-440b-4c14-ace5-c1b79ee9f3d5-47742",
   "access":"READ_EXECUTE",
   "enforcer":"hadoop-acl",
   "repo":"bigdata_hadoop",
   "reqUser":"mapred",
   "resource":"/mr-history/tmp",
   "cliIP":"127.0.0.1",
   "logType":"RangerAudit",
   "result":1,
   "policy":-1,
   "repoType":1,
   "resType":"path",
   "reason":"/mr-history/tmp",
   "action":"read",
   "evtTime":"2016-10-26T05:16:21.674Z",
   "seq_num":71568,
   "event_count":1,
   "event_dur_ms":0,
   "_ttl_":"+14DAYS",
   "_expire_at_":"2016-11-09T05:16:23.118Z",
   "_version_":1549228030682988500
}

Above details and examples are really the major pieces when it comes to TTL in Solr :)

Remove all documents from a Collection

In case you want to remove all documents from a Solr Collection, the following command might be helpful:

via Curl

curl -v --negotiate -u : "http://horton0.example.com:8983/solr/films/update?commit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:*</query></delete>"

via Browser

Alternatively, open the following URL in your browser

http://horton0.example.com:8983/solr/films/update?commit=true&stream.body=<delete><query>*:*</query>...;

Useful Links

https://lucene.apache.org/solr/5_3_0/solr-core/org/apache/solr/update/processor/DocExpirationUpdateP...

https://cwiki.apache.org/confluence/display/solr/Update+Request+Processors

Looking forward to your feedback

Jonas

.

6,748 Views
Comments

@Jonas Straub

The solrconfig.xml portion you provide doesn't work when I try to create a collection using the data driven schema configs. The problem seems to be the <processor> entries. I get null pointer exceptions.

The first processor entry should be <processor class="solr.DefaultValueUpdateProcessorFactory">. The second processor entry should be <processor class="solr.processor.DocExpirationUpdateProcessorFactory">.

Cloudera Employee

Thank you for writing this up @Jonas Straub and thanks for the extra details @Michael Young.

In case it helps anyone else, here is the entire "add-unknown-fields-to-the-schema" section. Not only did I have to apply the classes as @Michael Young has pointed out, but I also had to ensure that both these new processors were added at the top of the updateRequestProcessorChain.

This was using a stand-alone SolrCloud instance that was installed as part of Ambari 2.2.2 with HDP 2.4.2.

For that install, I found that I had to re-run the install SolrCloud install steps per the HDP Ranger Audits to SolrCloud documentation, consisting of:

  1. Running add_ranger_audits_conf_to_zk.sh
  2. Starting solr (start_solr.sh)
  3. Running create_ranger_audits_collection.sh
  <updateRequestProcessorChain name="add-unknown-fields-to-the-schema">
    <processor class="solr.DefaultValueUpdateProcessorFactory">
      <str name="fieldName">_ttl_</str>
      <str name="value">+14DAYS</str>
    </processor>
    <processor class="solr.processor.DocExpirationUpdateProcessorFactory">
      <int name="autoDeletePeriodSeconds">300</int>
      <str name="ttlFieldName">_ttl_</str>
      <str name="expirationFieldName">expire_at</str>
    </processor>
    <processor class="solr.RemoveBlankFieldUpdateProcessorFactory"/>
    <processor class="solr.ParseBooleanFieldUpdateProcessorFactory"/>
    <processor class="solr.ParseLongFieldUpdateProcessorFactory"/>
    <processor class="solr.ParseDoubleFieldUpdateProcessorFactory"/>
    <processor class="solr.ParseDateFieldUpdateProcessorFactory">
      <arr name="format">
        <str>yyyy-MM-dd'T'HH:mm:ss.SSSZ</str>
        <str>yyyy-MM-dd'T'HH:mm:ss,SSSZ</str>
        <str>yyyy-MM-dd'T'HH:mm:ss.SSS</str>
        <str>yyyy-MM-dd'T'HH:mm:ss,SSS</str>
        <str>yyyy-MM-dd'T'HH:mm:ssZ</str>
        <str>yyyy-MM-dd'T'HH:mm:ss</str>
        <str>yyyy-MM-dd'T'HH:mmZ</str>
        <str>yyyy-MM-dd'T'HH:mm</str>
        <str>yyyy-MM-dd HH:mm:ss.SSSZ</str>
        <str>yyyy-MM-dd HH:mm:ss,SSSZ</str>
        <str>yyyy-MM-dd HH:mm:ss.SSS</str>
        <str>yyyy-MM-dd HH:mm:ss,SSS</str>
        <str>yyyy-MM-dd HH:mm:ssZ</str>
        <str>yyyy-MM-dd HH:mm:ss</str>
        <str>yyyy-MM-dd HH:mmZ</str>
        <str>yyyy-MM-dd HH:mm</str>
        <str>yyyy-MM-dd</str>
      </arr>
    </processor>
    <processor class="solr.AddSchemaFieldsUpdateProcessorFactory">
      <str name="defaultFieldType">text_general</str>
      <lst name="typeMapping">
        <str name="valueClass">java.lang.Boolean</str>
        <str name="fieldType">booleans</str>
      </lst>
      <lst name="typeMapping">
        <str name="valueClass">java.util.Date</str>
        <str name="fieldType">tdates</str>
      </lst>
      <lst name="typeMapping">
        <str name="valueClass">java.lang.Long</str>
        <str name="valueClass">java.lang.Integer</str>
        <str name="fieldType">tlongs</str>
      </lst>
      <lst name="typeMapping">
        <str name="valueClass">java.lang.Number</str>
        <str name="fieldType">tdoubles</str>
      </lst>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory"/>
    <processor class="solr.RunUpdateProcessorFactory"/>
  </updateRequestProcessorChain>
New Contributor

Thank you so much!!

Contributor

When using Ambari managed INFRA-SOLR, we can also change TTL value from Ambari webUI and that would be an easy solution.

## Change Retention/TTL value of the ranger_audits collection in Ambari UI
Ambari UI->Ranger->configs->advanced->advanced ranger solr configuration ->Max Retention Days

## Save & Restart required services

Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
1 of 1
Last update:
‎10-27-2016 06:22 PM
Updated by:
 
Contributors
Top Kudoed Authors