Support Questions

Find answers, ask questions, and share your expertise

solr indexing from folder in hdfs

avatar
Expert Contributor

Hi,

I tried to index the files in a folder on HDFS; my solr configuration is the following:

./solr start -cloud -s ../server/solr -p 8983 -z 10.0.2.15:2181 -Dsolr.directoryFactory=HdfsDirectoryFactory -Dsolr.lock.type=hdfs -Dsolr.data.dir=hdfs://10.0.2.15:8020/user/solr -Dsolr.updatelog=hdfs://10.0.2.15:8020/user/solr

when I launch:

hadoop jar /opt/lucidworks-hdpsearch/job/lucidworks-hadoop-job-2.0.3.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.DirectoryIngestMapper -c Collezione -i /user/solr/documents -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -zk 10.0.2.15:2181/solr

I get the following error:

Solr server not available on: <a href="http://10.0.2.15:2181/solr">http://10.0.2.15:2181/solr</a>
Make sure that collection [Collezione] exists

The collection exists and is valid, but it looks like it is not able to contact the server.

I'd really appreciate some help in solving this problem.

Davide

1 ACCEPTED SOLUTION

avatar

@Davide Isoardi I was able to fix your problem, please try the following solution:

1)Create jaas-file, called jaas.conf

This file can be empty, doesnt really matter since your env. is not kerberized.

2) Start your Job with the following command

hadoop jar /opt/lucidworks-hdpsearch/job/lucidworks-hadoop-job-2.0.3.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -Dlww.jaas.file=jaas.conf -cls com.lucidworks.hadoop.ingest.DirectoryIngestMapper --collection test -i file:///data/* -of com.lucidworks.hadoop.io.LWMapRedOutputFormat --zkConnect horton01.example.com:2181,horton02.example.com:2181,horton03.example.com:2181/solr

The order of the parameters needs to be the same as in the above command, otherwise the job might not work.

I believe this is a bug, could you please report this issue to Lucidworks? Thanks.

View solution in original post

12 REPLIES 12

avatar
New Contributor

Hello everyone!

I'm struggling with the same problem. I've installed Hortonworks 2.3 on 3 Machines using the Installation Guide, after that I've installed hdpsearch according to the docs too so the current state of my configs is pretty ootb. I can run propely all the steps but failing in this last one.

The collection exists, my cluster is not kerberized, I'm using all the zk instances, I've tried to run it without the /solr but nothing.

Update:

I've also followed the good practices to clean and chroot my SolrCloud following this post Best Practice: 'chroot' your Solr Cloud in ZooKeeper.

Still having the same issue when trying to index with the DirectoryIngestMapper:

Solr server not available on: 10.1.0.4:2181,10.1.0.5:2181,10.1.0.6:2181/solr
Make sure that collection [boletines_cba] exists

Does anyone have some insight on how to solve this issue?

Best regards,

avatar
New Contributor

I've been able to sort this problem. I had a wrong field in the initParams definition in the solrconfig.xml file. I detected this error in Solr's logs.

After fixing it, the MapReduce job started working. I wonder why this impact in DirectoryIngestMapper because I was using that Solr config in another envs for testing and I was able to index without problem with other requestHandlers. Seems that the Mapper class depends on that config at some point.

Regards,

avatar
New Contributor

hi ,

the above code is working fine for me, thanks u,

but if some more documents are landed in the same hdfs directory for evry 1 hour, in that case what will the best solution to do index on only new documents which are located in hdfs