Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Indexing files on HDFS using solrcloud on cloudera and searching with hue

Highlighted

Indexing files on HDFS using solrcloud on cloudera and searching with hue

New Contributor

Hi Team, 

 

I'm trying to create a index for set of .xml files stored on HDFS. Below are the steps i followed to create collections and index data on hdfs.

 

I'm running my solr in SolrCloud mode whihc means "cloud" tab is visible in Solr admin UI.

 

1. Created a instance directory using  "solrctl" utility. I followed first two steps (initiating collections) from below URL

http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/search_validate_deploy_...

 

when i try to perform the 3rd step (from initiation collections) .I'm getting below error

 

Error: A call to SolrCloud WEB APIs failed: HTTP/1.1 400 Bad Request
Server: Apache-Coyote/1.1
Content-Type: application/xml;charset=UTF-8
Transfer-Encoding: chunked
Date: Mon, 10 Aug 2015 23:29:34 GMT
Connection: close

<?xml version="1.0" encoding="UTF-8"?>

<response>

<lst name="responseHeader">
<int name="status">
400</int>
<int name="QTime">
153</int>
</lst>
<str name="Operation createcollection caused exception:">
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Cannot create collection collection3. Value of maxShardsPerNode is 1, and the number of live nodes is 1. This allows a maximum of 1 to be created. Value of numShards is 2 and value of replicationFactor is 1. This requires 2 shards to be created (higher than the allowed number)</str>
<lst name="exception">
<str name="msg">
Cannot create collection collection3. Value of maxShardsPerNode is 1, and the number of live nodes is 1. This allows a maximum of 1 to be created. Value of numShards is 2 and value of replicationFactor is 1. This requires 2 shards to be created (higher than the allowed number)</str>
<int name="rspCode">
400</int>
</lst>
<lst name="error">
<str name="msg">
Cannot create collection collection3. Value of maxShardsPerNode is 1, and the number of live nodes is 1. This allows a maximum of 1 to be created. Value of numShards is 2 and value of replicationFactor is 1. This requires 2 shards to be created (higher than the allowed number)</str>
<int name="code">
400</int>
</lst>

</response>

 

To get rid of above error, i create collections using Solr API.

 

curl 'http://localhost:8983/solr/admin/collections?action=CREATE&name=collection1&numShards=2&replicationFactor=1&maxShardsPerNode=2'

 

Using curl i'm able to create a collection. 

 

2. After this manually i created two directories on hdfs as below 

http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/search_data_index_prepa...

 

/user/hadoop/indir - where all my .xml data resides

/user/hadoop/outdir- index will be written here.

 

3. I follwed these steps to index data on hdfs 

 

http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/search_load_index_data....

 

based on above URL i customized my script and ran below command to create index for .xml files in indir

 

hadoop \
jar \
/opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool \
-D 'mapred.child.java.opts=-Xmx500m' \
--morphline-file /opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/share/doc/search-1.0.0+cdh5.4.2+0/quickstart/morphlines.conf \
--output-dir hdfs://10.54.8.183/user/hadoop/outdir \
--verbose \
--go-live \
--zk-host 10.0.8.184:2181,10.0.8.185:2181,10.0.8.186:2181/solr \
--collection dicomcollection \
hdfs://10.54.8.183/user/hadoop/indir

.

 

Some  map reduce script is runnig and something is written in Outdir. But when i searched the collection through Hue, nothing is retrived. Can anyone help me where i'm going wrong?

 

 

Thanks 

Karthik Vadla

1 REPLY 1
Highlighted

Re: Indexing files on HDFS using solrcloud on cloudera and searching with hue

Cloudera Employee

Hi Karthik,

 

It seems that you have two problems here. You've gotten around the first one, but I want to make sure that you (as well as others who see this) understand what caused it.

 

 

The instructions for creating the collection in the documentation are just an example, which you might need to modify slightly based on how your particular cluster is set up. In this case, it specified the command as "solrctl collection --create collection1 -s 2 -c collection1" which means to create a collection named "collection1" with two shards. Using multiple shards can improve scalability by allowing the collection to be split up among multiple machines, which allows you to exceed the capacity of a single machine and can improve performance because machines holding those shards can search the data in parallel. However, using multiple shards mainly makes sense when you have multiple nodes running Solr, and the error message "Value of maxShardsPerNode is 1, and the number of live nodes is 1. This allows a maximum of 1 to be created. Value of numShards is 2 and value of replicationFactor is 1" is telling you that you're trying to create more shards (2) than you have machines. By default, this is not allowed; however, you can override this by overriding the value of the "maxShardsPerNode" parameter, as you did with the curl command. Hopefully that explanation makes sense.

 

As to why there's no data when you search the collection in Hue, the most likely cause is that the indexing process completed successfully, but that no data was loaded into Solr as a result. There could be a lot of reasons a to why that could happen, such as pointing to an input directory without any data, filtering out all the data in your Morphlines configuration, or omitting the loadSolr command from your Morphlines command. I'd recommend adding some log statements to your Morphlines.conf file, running your job again, and then looking at the logs from that Job (e.g., in Hue) so you can see what's happening.

 

BTW, I should also point out that Cloudera offers a training class on Cloudera Search. It covers all of these topic and more, everything from how to design your schema, set up the collections, write Morphlines files for transforming and loading data, debugging failures, and building a Search UI with Hue. You can find out more about the course here (http://cloudera.com/content/cloudera/en/training/course-listing.html?course=search&loc=all).

Don't have an account?
Coming from Hortonworks? Activate your account here