Member since
07-29-2013
20
Posts
7
Kudos Received
4
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
13580 | 06-26-2014 10:05 AM | |
7362 | 06-25-2014 10:35 AM | |
13020 | 04-09-2014 11:07 AM | |
2883 | 04-01-2014 12:01 PM |
11-12-2014
06:45 AM
I agree with Clint, Bulk Loading into HBase every 3 minutes is too often and will cause a ton of compactions. To remedy the splits you should have an overall understanding of what your data will look like 6 months - 1 year from now and pre-split the table upon creation. This should give you enough regions to load all of your data without having to split everytime. This is a best practice for puts as well. Also with regards to Bulk Loading early versions of CDH4 had some issues with sequence numbers and I would advise moving to CDH 5.1.3.
... View more
06-27-2014
07:02 AM
1 Kudo
You can follow the same steps I sent you, but you will need to switch to https://github.com/cloudera/search#cdk-morphlines-solr-cellcdk-morphline-solr-cell morphline instead of the CSV one in the example.
... View more
06-27-2014
06:53 AM
1 Kudo
Bala, It absolutely is. I was just giving you a sample set of instructions so you could play with a CSV file ingest. You will be looking to use Apache Tika. The good news is there is a morphline to help you with that. The bad new is you will have to write that morphline. I would recommend starting here: https://github.com/cloudera/search#cdk-morphlines-solr-cell
... View more
06-27-2014
06:44 AM
Hi Bala, I have found that very rarely is data truly unstructured. What kind of data is it? Typically, there is some form of structure to the data. Can you send me a sample file kevin@cloudera.com
... View more
06-27-2014
06:36 AM
2 Kudos
Bala, Create a local Solr project directory and schema Execute the following commands to create a project directory. You can specify whatever directory you like; I will use ~/sample07: $ export PROJECT_HOME=~/sample07 $ solrctl instancedir --generate $PROJECT_HOME This will create the $PROJECT_HOME/conf directory that contains a number of files, including a default schema.xml Replace the <fields>, <uniqueKey> and <copyField> elements in $PROJECT_HOME/conf/schema.xml with the values of your data: <fields> <field name="code" type="string" indexed="true" stored="true"/> <field name="description" type="string" indexed="true" stored="true"/> <field name="salary" type="int" indexed="true" stored="true" /> <field name="total_emp" type="int" indexed="true" stored="true" /> <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/> <field name="_version_" type="long" indexed="true" stored="true"/> </fields> <uniqueKey>code</uniqueKey> <copyField <copyField <copyField <copyField source="code" dest="text"/> source="description" dest="text"/> source="salary" dest="text"/> source="total_emp" dest="text"/> **You will use the fields that match your data** Create a morphline Create a file named "morphline1.conf" in the $PROJECT_HOME directory with this text, which will parse the data-file into records and fields, fix some non-numeric data and load the records into Solr. Make sure to replace the hostname in the zkHost field with the hostname of a Zk Server. Don't use "localhost" as the Zk hostname will be used on data nodes during the MapReduce-based batch indexing process. # morphline1.conf SOLR_LOCATOR : { collection : Sample-07-Collection # ZooKeeper ensemble -- set this to your cluster's Zk hostname(s) zkHost : "ZK_HOST:2181/solr" } morphlines : [ { id : morphline1 importCommands : ["org.kitesdk.**", "org.apache.solr.**"] commands : [ # Read the CSV data { readCSV { separator : "\t" columns : ["code","description","total_emp","salary"] ignoreFirstLine : false trim : false charset : UTF-8 } } { sanitizeUnknownSolrFields { solrLocator : ${SOLR_LOCATOR} } } # load the record into a Solr server or MapReduce Reducer. { loadSolr { solrLocator : ${SOLR_LOCATOR} } } ] } ] Note also the column names are set; they should match the fields in schema.xml. Grab a log4j.properties file Copy the example log4j.properties file into the $PROJECT_HOME directory: $ cp /opt/cloudera/parcels/CDH/share/doc/search-1.0.0+cdh5.0.0+0/examples/solr-nrt/log4j.pr operties $PROJECT_HOME It may be useful for debugging later to set log4j.logger.org.kitesdk.morphline=TRACE Create the Solr instance dir Execute this command to create the Solr instance directory: $ solrctl --zk localhost:2181/solr instancedir --create Sample-07-Collection $PROJECT_HOME Create the Solr Collection Execute this command to create the Solr Collection. Note the "-s" argument defines the number of shards, which should correspond to the number of Solr Server instances you have. In my case I have Solr Servers deployed on two nodes: $ solrctl --zk localhost:2181/solr collection --create Sample-07-Collection -s 2 Connect to the Solr WebUI on either node to see the Collection has been created with two shards Perform a "dry-run" to test your morphline Run the batch-indexing process (as a user with write permissions in HDFS) as follows, including the --dry-run argument (make sure to replace all the host names in the command with the correct values). $ hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx500m' --log4j $PROJECT_HOME/log4j.properties --morphline-file $PROJECT_HOME/morphline1.conf --output-dir hdfs://mbrooks0:8020/user/mark/sample07/ --verbose --go-live --zk-host mbrooks0:2181/solr --collection Sample-07-Collection --dry-run hdfs://mbrooks0:8020/user/hive/warehouse/sample_07 See the docs here for details on the MapReduceIndexerTool. If the dry run completes without errors you should see output like this at the end of the log: 3362 [main] INFO org.apache.solr.hadoop.MapReduceIndexerTool files in dryrun mode took 0.431 secs 3362 [main] INFO org.apache.solr.hadoop.MapReduceIndexerTool Program took 3.404 secs. Goodbye. Run the batch-index job for real Once your dry-run looks good, run the same command as the dry-run test above without the --dry-run argument. I ran my test using YARN/MR2. Make sure the machine you are running the job on has the appropriate Gateway roles provisioned (in my case YARN Gateway role) and the latest client configs. Using the Solr WebUI we can see X records in shard 1
... View more
06-26-2014
10:05 AM
2 Kudos
Under your browser settings I would make sure you don't have any proxy information setup there.
... View more
06-25-2014
10:35 AM
Bala, Follow steps: Create a local Solr project directory and schema Through Viewing the results This will have you setup a SOLR index in HDFS. You can use any CSV file for sample data.
... View more
06-25-2014
07:59 AM
Hi Bala, You will have to create your own custom webapp then. We dont have a tutorial on that, readily available.
... View more
06-24-2014
06:41 AM
1 Kudo
Hi Bala, Can you please send me an email offline and I will send you a quick Solr example guide(I can't attach files here)? My contact information is kevin@cloudera.com
... View more
04-09-2014
11:12 AM
It could, can you connect to the HBase cluster from one of the nodes in the cluster(aka not windows or your IDE)?
... View more