Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

HDP Search - Best way to index list of HDFS files

HDP Search - Best way to index list of HDFS files

New Contributor

Hi All,

We are trying to setup Solr(HDP search) on our HortonWorks Cluster.

As a part of this, I have installed lucidworks-hdpsearch, started solr in cloud mode and indexed sample documents using lucidworks-hadoop-job-2.0.3.jar - DirectoryIngestMapper.

As per my understanding DirectoryIngestMapper indexes all files present under given directory using apache tika indexing methods. We have a setup wherein new files keep landing on a HDFS directory and we just want to index those set of new files instead of whole directory.

Please let me know if there is a mapper/indexing method(preferably Tika indexing) that can be used on a list of HDFS files rather than entire directory.

Regards,

Anil

1 REPLY 1
Highlighted

Re: HDP Search - Best way to index list of HDFS files

@Anil Ekambram You can use the lucidworks-hadoop-job-2.0.3.jar to ingest specific files as well.

For example:

hadoop jar /opt/lucidworks-hdpsearch/job/lucidworks-hadoop-job-2.0.3.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -Dlww.jaas.file=/opt/lucidworks-hdpsearch/solr/bin/jaas.conf -cls com.lucidworks.hadoop.ingest.DirectoryIngestMapper --collection MyCollection -i hdfs://hortoncluster/data/my_file.txt -of com.lucidworks.hadoop.io.LWMapRedOutputFormat --zkConnect horton01:2181/solr
Don't have an account?
Coming from Hortonworks? Activate your account here