We are trying to setup Solr(HDP search) on our HortonWorks Cluster.
As a part of this, I have installed lucidworks-hdpsearch, started solr in cloud mode and indexed sample documents using lucidworks-hadoop-job-2.0.3.jar - DirectoryIngestMapper.
As per my understanding DirectoryIngestMapper indexes all files present under given directory using apache tika indexing methods. We have a setup wherein new files keep landing on a HDFS directory and we just want to index those set of new files instead of whole directory.
Please let me know if there is a mapper/indexing method(preferably Tika indexing) that can be used on a list of HDFS files rather than entire directory.
@Anil Ekambram You can use the lucidworks-hadoop-job-2.0.3.jar to ingest specific files as well.
hadoop jar /opt/lucidworks-hdpsearch/job/lucidworks-hadoop-job-2.0.3.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -Dlww.jaas.file=/opt/lucidworks-hdpsearch/solr/bin/jaas.conf -cls com.lucidworks.hadoop.ingest.DirectoryIngestMapper --collection MyCollection -i hdfs://hortoncluster/data/my_file.txt -of com.lucidworks.hadoop.io.LWMapRedOutputFormat --zkConnect horton01:2181/solr