Support Questions

Find answers, ask questions, and share your expertise

Make Solr to use HDFS

Expert Contributor

Hello ,

 

I have installed Cloudera manager 5  and using it I installed Solr , Zookeper , HDFS and Yarn services. 

I am trying to  do the following :

 

1. Load data to the HDFS

2. Access the HDFS using Solr .

 

Please suggest me steps to acheive the same .

 

Thanks

Bala

Thanks
Bala
1 ACCEPTED SOLUTION

Explorer

Bala,

 

 

  Follow steps:

 

Create a local Solr project directory and schema

 

Through

 

Viewing the results

 

This will have you setup a SOLR index in HDFS.  You can use any CSV file for sample data.

 

View solution in original post

18 REPLIES 18

Explorer

Hi Bala,

 

  Can you please send me an email offline and I will send you a quick Solr example guide(I can't attach files here)?  My contact information is kevin@cloudera.com

Expert Contributor
Hello Kevin ,

I read the solr example given but I am bit confused after that . I saw Hue is been used . Hue is a Web application , but i actually need to create my own web app where i should use Solr for querying data from HDFS.

Thanks
Bala
Thanks
Bala

Expert Contributor
Hi Kevin ,

Awaiting for your response .

Thanks
Bala
Thanks
Bala

Explorer

Hi Bala,

 

  You will have to create your own custom webapp then.  We dont have a tutorial on that, readily available.

Expert Contributor
Hi Kevin ,

How can i Use Solr over HDFS . if this part is done then i can be able to complete the webapp UI.

Thanks
Bala
Thanks
Bala

Explorer

Bala,

 

 

  Follow steps:

 

Create a local Solr project directory and schema

 

Through

 

Viewing the results

 

This will have you setup a SOLR index in HDFS.  You can use any CSV file for sample data.

 

Expert Contributor
Kevin , Can u please elaborate . Currently I am able to load data into HDFS using hdfs put command . Now can u tell me how to make it available for solr to query it ?
Thanks
Bala

Explorer

Bala,

 

Create a local Solr project directory and schema
Execute the following commands to create a project directory. You can specify whatever directory you like; I will use ~/sample07:
$ export PROJECT_HOME=~/sample07
$ solrctl instancedir --generate $PROJECT_HOME
This will create the $PROJECT_HOME/conf directory that contains a number of files, including a default schema.xml

 

Replace the <fields>, <uniqueKey> and <copyField> elements in $PROJECT_HOME/conf/schema.xml with the values of your data&colon;
<fields>
<field name="code" type="string" indexed="true" stored="true"/>
<field name="description" type="string" indexed="true" stored="true"/>
<field name="salary" type="int" indexed="true" stored="true" />
<field name="total_emp" type="int" indexed="true" stored="true" />
<field name="text" type="text_general" indexed="true" stored="false"
multiValued="true"/>
<field name="_version_" type="long" indexed="true" stored="true"/>
</fields>
<uniqueKey>code</uniqueKey>
<copyField
<copyField
<copyField
<copyField
source="code" dest="text"/>
source="description" dest="text"/>
source="salary" dest="text"/>
source="total_emp" dest="text"/>

 

**You will use the fields that match your data**

 

Create a morphline
Create a file named "morphline1.conf" in the $PROJECT_HOME directory with this text, which will parse the data-file into records and fields, fix
some non-numeric data and load the records into Solr.
Make sure to replace the hostname in the zkHost field with the hostname of a Zk Server. Don't use "localhost" as the Zk hostname will be used on data nodes during the MapReduce-based batch indexing process.

 

# morphline1.conf
SOLR_LOCATOR : {
collection : Sample-07-Collection
# ZooKeeper ensemble -- set this to your cluster's Zk hostname(s)
zkHost : "ZK_HOST:2181/solr"
}
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
commands : [
# Read the CSV data
{
readCSV {
separator : "\t"
columns : ["code","description","total_emp","salary"]
ignoreFirstLine : false
trim : false
charset : UTF-8
}
}

{ sanitizeUnknownSolrFields { solrLocator : ${SOLR_LOCATOR} } }
# load the record into a Solr server or MapReduce Reducer.
{ loadSolr { solrLocator : ${SOLR_LOCATOR} } }
]
}
]

Note also the column names are set; they should match the fields in

schema.xml.

 

Grab a log4j.properties file
Copy the example log4j.properties file into the $PROJECT_HOME directory:
$ cp
/opt/cloudera/parcels/CDH/share/doc/search-1.0.0+cdh5.0.0+0/examples/solr-nrt/log4j.pr
operties $PROJECT_HOME
It may be useful for debugging later to set log4j.logger.org.kitesdk.morphline=TRACE

 

Create the Solr instance dir
Execute this command to create the Solr instance directory:
$ solrctl --zk localhost:2181/solr instancedir --create Sample-07-Collection
$PROJECT_HOME

 

Create the Solr Collection
Execute this command to create the Solr Collection. Note the "-s" argument defines the number of shards, which should correspond to the
number of Solr Server instances you have. In my case I have Solr Servers deployed on two nodes:
$ solrctl --zk localhost:2181/solr collection --create Sample-07-Collection -s 2

 

Connect to the Solr WebUI on either node to see the Collection has been created with two shards

 

Perform a "dry-run" to test your morphline
Run the batch-indexing process (as a user with write permissions in HDFS) as follows, including the --dry-run argument (make sure to replace all
the host names in the command with the correct values).
$ hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar
org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx500m'
--log4j $PROJECT_HOME/log4j.properties --morphline-file $PROJECT_HOME/morphline1.conf
--output-dir hdfs://mbrooks0:8020/user/mark/sample07/ --verbose --go-live --zk-host
mbrooks0:2181/solr --collection Sample-07-Collection --dry-run
hdfs://mbrooks0:8020/user/hive/warehouse/sample_07
See the docs here for details on the MapReduceIndexerTool.
If the dry run completes without errors you should see output like this at the end of the log:

3362 [main] INFO org.apache.solr.hadoop.MapReduceIndexerTool
files in dryrun mode took 0.431 secs
3362 [main] INFO org.apache.solr.hadoop.MapReduceIndexerTool
Program took 3.404 secs. Goodbye.

 

Run the batch-index job for real
Once your dry-run looks good, run the same command as the dry-run test above without the --dry-run argument. I ran my test using YARN/MR2.
Make sure the machine you are running the job on has the appropriate Gateway roles provisioned (in my case YARN Gateway role) and the latest client configs.

 

Using the Solr WebUI we can see X records in shard 1

 

Expert Contributor
Thanks a lot for the explanation kevin . I am on it . But My Data is completely unstructured . How can i define fields for it ???
Thanks
Bala

Explorer

Hi Bala,

 

  I have found that very rarely is data truly unstructured.  What kind of data is it?  Typically, there is some form of structure to the data.  Can you send me a sample file kevin@cloudera.com

Expert Contributor
Kevin , The data consists of Rich documents (txt , pdf , doc files) It does not hold any particular structure . Is it possible to extract the data out of this format ??
Thanks
Bala

Explorer

Bala,

 

  It absolutely is.  I was just giving you a sample set of instructions so you could play with a CSV file ingest.  You will be looking to use Apache Tika.  The good news is there is a morphline to help you with that.  The bad new is you will have to write that morphline.  I would recommend starting here: https://github.com/cloudera/search#cdk-morphlines-solr-cell

Expert Contributor
Kevin , in the earlier briefing you have mentioned about morphline . So should i proceed with the earlier steps you have asked me to follow . Or should i go through this first ? https://github.com/cloudera/search#cdk-morphlines-solr-cell
Thanks
Bala

Explorer

You can follow the same steps I sent you, but you will need to switch to https://github.com/cloudera/search#cdk-morphlines-solr-cellcdk-morphline-solr-cell morphline instead of the CSV one in the example.

Expert Contributor
Kevin, How to use CDK ?
Thanks
Bala

Expert Contributor
Hello Kevin ,

I am still not able to figure out how to use the CDK u have mentioned 😞 .. Need help ..

Thanks
Bala
Thanks
Bala

Expert Contributor

Kevin , I followed the steps , It working as expected in dry run. But when i run without dry--run argument . It stops at this step 😞 😞

 

770  [main] INFO  org.apache.solr.cloud.ZkController  – Write file /tmp/1404354031741-0/velocity/facet_fields.vm
771  [main] INFO  org.apache.solr.cloud.ZkController  – Write file /tmp/1404354031741-0/elevate.xml
773  [main] INFO  org.apache.solr.cloud.ZkController  – Write file /tmp/1404354031741-0/admin-extra.menu-bottom.html
774  [main] INFO  org.apache.solr.cloud.ZkController  – Write file /tmp/1404354031741-0/schema.xml
897  [main] INFO  org.apache.solr.hadoop.MapReduceIndexerTool  – Indexing 1 files using 1 real mappers into 1 reducers

 

It stops in 897 itself . I restarted and tried , still the same .

 

Any help .

 

Thanks

Bala

Thanks
Bala

New Contributor

Is incremental load to Solr is possible? Meaning that If the dataset set that is going to load in the solr has some unique keys ( with or without update in other fields of the record) that are already present in the solr collection, I want existing records get updated and new record get inserted in Solr collection. Could you please let me know if it is possible in Solr or not. If yes, please advice in achieving the same.