Member since 
    
	
		
		
		09-15-2015
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                457
            
            
                Posts
            
        
                507
            
            
                Kudos Received
            
        
                90
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 16919 | 11-01-2016 08:16 AM | |
| 12560 | 11-01-2016 07:45 AM | |
| 11693 | 10-25-2016 09:50 AM | |
| 2490 | 10-21-2016 03:50 AM | |
| 5256 | 10-14-2016 03:12 PM | 
			
    
	
		
		
		10-27-2016
	
		
		06:22 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		14 Kudos
		
	
				
		
	
		
					
							 
	In this article I'd
like to focus on how to automatically remove
Solr documents from a collection. I ran into a lot of questions and
issues recently that all came back to TTL (time-to-live) of documents, e.g.   
	 nodes running out of space 	 company policies requiring to remove old audit
logs
	 	 auto-purging etc.   
	SOLR-5795 introduced
a new UpdateProcesser called DocExpirationUpdateProcessorFactory, which allows
us to add an expiration date to solr documents and making sure expired
documents are removed automatically.  
	How does it work?  
	Every indexed
document will get an expiration field, which contains the date and time at
which the document will "expire". This field is calculated relative
to the time it is indexed as well as by using the configuration provided by a
field called _ttle_ (this is the default
name, you can always rename it if you want). _ttl_ sets the lifetime of
documents, e.g. +10DAYS, +2WEEKS,+4HOURS,...  
	For example:  
	Current Time is:
2016-10-26 20:14:00  _ttl_ is defined as:
+2HOURS  This will result in
an expiration value of 2016-10-26 22:14:00  
	The deleteByQuery is
triggered as often as configured with autoDeletePeriodSeconds, e.g. a value of
86400 would trigger a background thread that is executing deleteByQueries on a
daily basis (86400 seconds = 1 day). These queries will delete all documents where the expiration timestamp  is in the past (relative to NOW or the time the thread was started).  
	If you want
to customize the delete procedure, you can use autoDeleteChainName to configure
your own updateRequestProcessorChain, which is used for all the deletes.  
	Once the removal process is
finished, a soft commit is triggered and the documents wont appear in any
search anymore.  
	In the next section
I show some use cases and examples that I have seen more often recently.  Solr
in General
  
	First, lets look at
a general example that uses TTL, for this we are going to use the films collection (we
have used that in other articles as well, so I wont go much into detail). Movies will be stored in a Solr Collection, however we don't want to keep movies
more than 10days!  
	Go to the first node of your Solr Cloud (doesn't really
matter which node, but we need the zkcli client)  
	Create the initial
Solr Collection configuration by using the basic_config,
which is part of every Solr installation.  mkdir /opt/lucidworks-hdpsearch/solr_collections
mkdir /opt/lucidworks-hdpsearch/solr_collections/films 
chown -R solr:solr /opt/lucidworks-hdpsearch/solr_collections 
cp -R /opt/lucidworks-hdpsearch/solr/server/solr/configsets/basic_configs/conf /opt/lucidworks-hdpsearch/solr_collections/films  Adjust schema.xml
(/opt/lucidworks-hdpsearch/solr_collections/films/conf)  
	Add the following
field definitions in the schema.xml file (There are already some base field
definitions, simply copy-and-paste the following 4 lines somewhere nearby).  <field name="directed_by" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="name" type="text_general" indexed="true" stored="true"/>
<field name="initial_release_date" type="string" indexed="true" stored="true"/>
<field name="genre" type="string" indexed="true" stored="true" multiValued="true"/>  We also add the
following fields for the auto-purging:  <field name="_ttl_" type="string" indexed="true" multiValued="false" stored="true" />
<field name="expire_at" type="date" multiValued="false" indexed="true" stored="true" />  _ttl_ = Amount of
time this document should be kept (e.g. +10DAYS)  
	expire_at = The
calculated expiration date (INDEX_TIME + _ttl_)  
	Adjust solrconfig.xml  
	In order for the
expiration date to be calculated, we have to add the new DocExpirationUpdateProcessorFactory  <updateRequestProcessorChain name="add-unknown-fields-to-the-schema">
	<processor>
		<str name="fieldName">_ttl_</str>
		<str name="value">+30DAYS</str>
	</processor>
	<processor>
		<int name="autoDeletePeriodSeconds">300</int>
		<str name="ttlFieldName">_ttl_</str>
		<str name="expirationFieldName">expire_at</str>
	</processor>
	<processor class="solr.LogUpdateProcessorFactory" />
	<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>  Also make sure the processor chain is triggered with every update:  <initParamspath="/update/**,/query,/select,/tvrh,/elevate,/spell">
	<lst name="defaults">
		<str name="df">text</str>
		<str name="update.chain">add-unknown-fields-to-the-schema</str>
	</lst>
</initParams>  Upload the config, create the collection and index the sample data  /opt/lucidworks-hdpsearch/solr/server/scripts/cloud-scripts/zkcli.sh -zkhost horton0.example.com:2181/solr -cmd upconfig -confname films -confdir /opt/lucidworks-hdpsearch/solr_collections/films/conf
curl --negotiate -u : "http://horton0.example.com:8983/solr/admin/collections?action=CREATE&name=films&numShards=1"
curl --negotiate -u : ' http://horton0.example.com:8983/solr/films/update/json?commit=true' --data-binary @/opt/lucidworks-hdpsearch/solr/example/films/films.json -H 'Content-type:application/json'  Select a single document from the Films-Collection  curl --negotiate -u : http://horton0.example.com:8983/solr/films/select?q=*&start=0&rows=1&wt=json  Result:  {
   "id":"/en/45_2006",
   "directed_by":[
      "Gary Lennon"
   ],
   "initial_release_date":"2006-11-30",
   "genre":[
      "Black comedy",
      "Thriller",
      "Psychological thriller",
      "Indie film",
      "Action Film",
      "Crime Thriller",
      "Crime Fiction",
      "Drama"
   ],
   "name":".45",
   "_ttl_":"+10DAYS",
   "expire_at":"2016-11-06T05:46:46.565Z",
   "_version_":1549320539674247200
}  Ranger
Solr Audits (Ambari Infra or custom Solr Cloud)
  
	Ranger Audits can be
stored in a custom SolrCloud or the one that is provided by Ambari Infra.  
	Ambari Infra is a
new service that includes its own Solr instances, e.g. to store Ranger audits or Atlas details.
Since HDP 2.5 we have officially moved away from Audits to DB and moved to Solr. Solr (as well as DB) is only a short-term storage when it comes to Ranger
audits, basically its only used for the audit information displayed in the
Ranger Admin UI. Long-term archival of audits should be stored in HDFS or
something similar.  
	By default, the
Ranger Solr Audit Collection comes with a pre-configured TTL, so all the Ranger
Audits in Solr will be deleted after 90 days out of the box.  
	What happens if you
only want to store audit logs for 30 days or one week? Take a look at the next
paragraphs 🙂   New
Installation; Solr Audits = disabled  
	If you haven't used
Solr Audits before and haven't enabled Ranger Audits to Solr via Ambari yet, it will
be easy to adjust the TTL configuration. Go to your Ranger Admin node and execute the
following command:  
	This will reduce the
time we keep audits in Solr to 30 days:  sed -i 's/+90DAYS/+30DAYS/g' /usr/hdp/2.5.0.0-1245/ranger-admin/contrib/solr_for_audit_setup/conf/solrconfig.xml  Afterwards, you can
go to Ambari and enable Ranger Solr Audits, the collection that is going to be
created will use the new setting.  
	Sample Audit Log  {
   "id":"5519e650-440b-4c14-ace5-c1b79ee9f3d5-47734",
   "access":"READ_EXECUTE",
   "enforcer":"hadoop-acl",
   "repo":"bigdata_hadoop",
   "reqUser":"mapred",
   "resource":"/mr-history/tmp",
   "cliIP":"127.0.0.1",
   "logType":"RangerAudit",
   "result":1,
   "policy":-1,
   "repoType":1,
   "resType":"path",
   "reason":"/mr-history/tmp",
   "action":"read",
   "evtTime":"2016-10-26T05:14:21.686Z",
   "seq_num":71556,
   "event_count":1,
   "event_dur_ms":0,
   "_ttl_":"+30DAYS",
   "_expire_at_":"2016-11-25T05:14:23.107Z",
   "_version_":1549227904852820000
}  As you can see, the
new _ttl_ is 30 DAYS  Old
Installation; Solr Audits = enabled
  
	In case you have
already enabled the Ranger Audit logs to Solr and have already collected plenty of
documents in that Solr Collection, you can adjust the TTL with the following steps. However, its
important to keep in mind that this does not affect old documents, only new
ones.  
	Go to one of the
Ambari Infra nodes that hosts a Solr Instance (again, any node with the zkcli
client)  
	Download the
solrconfig.xml from Zookeeper  /usr/lib/ambari-infra-solr/server/scripts/cloud-scripts/zkcli.sh --zkhost horton0.example.com:2181 -cmd getfile /infra-solr/configs/ranger_audits/solrconfig.xml solrconfig.xml  
	Edit the file or use
sed to replace the 90 Days in the solrconfig.xml  sed -i 's/+90DAYS/+14DAYS/g' solrconfig.xml  
	Upload the config
back to Zookeeper  /usr/lib/ambari-infra-solr/server/scripts/cloud-scripts/zkcli.sh --zkhost horton0.example.com:2181 -cmd putfile /infra-solr/configs/ranger_audits/solrconfig.xml solrconfig.xml  
	Reload the config  curl -v --negotiate -u : "http://horton0.example.com:8983/solr/admin/collections?action=RELOAD&name=ranger_audits"  
	Sample Audit Log  {
   "id":"5519e650-440b-4c14-ace5-c1b79ee9f3d5-47742",
   "access":"READ_EXECUTE",
   "enforcer":"hadoop-acl",
   "repo":"bigdata_hadoop",
   "reqUser":"mapred",
   "resource":"/mr-history/tmp",
   "cliIP":"127.0.0.1",
   "logType":"RangerAudit",
   "result":1,
   "policy":-1,
   "repoType":1,
   "resType":"path",
   "reason":"/mr-history/tmp",
   "action":"read",
   "evtTime":"2016-10-26T05:16:21.674Z",
   "seq_num":71568,
   "event_count":1,
   "event_dur_ms":0,
   "_ttl_":"+14DAYS",
   "_expire_at_":"2016-11-09T05:16:23.118Z",
   "_version_":1549228030682988500
}  Above details and examples are
really the major pieces when it comes to TTL in Solr 🙂  Remove
all documents from a Collection  In case you want to
remove all documents from a Solr Collection, the following command might be helpful:  via Curl  curl -v --negotiate -u : "http://horton0.example.com:8983/solr/films/update?commit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>*:*</query></delete>"  via Browser  Alternatively, open the following URL in your browser  http://horton0.example.com:8983/solr/films/update?commit=true&stream.body=<delete><query>*:*</query></delete>;  Useful
Links  https://lucene.apache.org/solr/5_3_0/solr-core/org/apache/solr/update/processor/DocExpirationUpdateProcessorFactory.html  https://cwiki.apache.org/confluence/display/solr/Update+Request+Processors  Looking forward to your feedback  Jonas  
	. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		10-26-2016
	
		
		03:40 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 I assume your solr instance is running under the solr-user? If yes, make sure all the ranger files and the directory "classes" is owned by that user.   Does that Solr Home directory exist, "/opt/solr_8001/data" ?  Also is it owned by the user that is running the solr instances? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-25-2016
	
		
		10:20 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Sorry I probably should have been more explicit, the ranger plugin script will copy all jars and xmls to the locations I mentioned above, you dont have to copy anything on your own. Can you run an "ls -al" on the two directories and post the result?  Also can you upload the Ranger xml files inside the "classes" directory?  How does your solr.in.sh look like? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-25-2016
	
		
		09:50 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Did you enable the Ranger Solr Plugin using the enable-ranger-plugin.sh script? What version of Solr and Ranger is this?   You might want to enable the Ranger Plugin again and make sure that all ranger jars/xmls have been copied to .../solr/server/solr-webapp/webapp/WEB-INF/classes and .../solr/server/solr-webapp/webapp/WEB-INF/libs  (Validate the paths, not sure if they are 100% correct) 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-21-2016
	
		
		03:50 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		4 Kudos
		
	
				
		
	
		
					
							  @Hajime Basically, Ambari Infra is only a wrapper, a service like HDFS or YARN, which deploys/manages different components. At the moment, the only component is Solr, which is open source.  As Constantin pointed out, the Ambari Infra Stack can be found here: https://github.com/apache/ambari/tree/2ad42074f1633c5c6f56cf979bdaa49440457566/ambari-server/src/main/resources/common-services/AMBARI_INFRA/0.1.0 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-17-2016
	
		
		06:05 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 The error messages are definitely related. Zookeeper is not able to authenticate itself and it looks like its shutting down because of that.  If you go to Ambari -> HDFS Service, filter/search for "Kerberos", "Principal" and "Keytab", do the same for the actual configurations under /etc/hadoop/conf. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-14-2016
	
		
		03:12 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Looks like ZKFC is still trying to authenticate itself using Kerberos, could you please check if all Kerberos configurations have been removed from the Hadoop Configuration and Kerberos is disabled  ERROR client.ZooKeeperSaslClient (ZooKeeperSaslClient.java:respondToServer(323)) - SASL authentication failed using login context 'Client'.  Also could you please check if this ZNode is still available  /hadoop-ha/dshdp-rlab 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-14-2016
	
		
		10:48 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Awesome! Glad it worked 🙂  Yes, the solrconfig for ranger audits is configured to use local storage. You can check the solrconfig.xml (either through zookeeper or Solr UI -> Select Collection in dropdown -> Files -> solrconfig.xml) for the exact path.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-14-2016
	
		
		04:25 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		6 Kudos
		
	
				
		
	
		
					
							 Hi @Edgar Daeds  I looked at your log file and it seems that your solr schema is broken or not valid.  at http://myhostname:8886/solr/ranger_audits: sort param field can't be found: evtTime, retry  Could you please delete the collection and its configure. Afterwards, let Ranger re-create the Collection and its configuration.  Delete Collections: https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api6  Delete Configuration (delete the collection first!):   1.Log into ZK  zookeeper-client -server <zk server & port>  2.Check what configurations are available  ls /infra-solr/configs  3.Delete configurations related to ranger audits (including the ones you have created). For example:  rmr /infra-solr/configs/ranger_audits  Now let Ranger re-create the Audit collection  🙂 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-13-2016
	
		
		03:55 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Edgar Daeds That is actually correct. Ranger is creating a solr collection, but you are looking at a single Shard of that collection on the UI. If you open another Ambari Infra Instance UI you will see that the URL changes to ..../ranger_audits_shard2_replica1....  Is this a kerberized environment?   Can you copy-paste the ranger.audit.solr.zookeepers configuration value? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
         
					
				













