Member since
02-06-2014
3
Posts
0
Kudos Received
0
Solutions
02-07-2014
07:30 AM
That's good to know. I believe I have the most recent quickstart VM (4.4.0-1). Are updated versions of the VMs made available regularly? Or do you know if this something that can be "patched" within the VM? (I'd like to demo something based on the search functionality with a view to requesting that our enterprise (cloudera but not sure what CDH version yet) cluster have Search enabled..)
... View more
02-07-2014
06:07 AM
A small additional piece of information is that by exploring the contents of the SOLR index via the Solr Admin web UI I can see that certain fields do indeed seem to be indexed with "org.apache.avro.util.Utf8:" prefix on the original strings. The fields in question are: user_screen_name user_location text user_name source From the batch_tweets.sh script I can see how it invokes the MapReduceIndexerTool pointing at the batch_tweets_indir location in HDFS (which contains the input data in avro format). From what I can understand I believe the morphline may be key to processeding the input data in HDFS and passing on to the indexer. Doe anybody know if that's a good place to dig further or should I look into the source code for MapReduceIndexerTool?
... View more
02-06-2014
01:43 PM
I'm trying out the Cloudera QuickStart VM and found it pretty straightforward to try out basic WordCount M/R example, Hive queries on sample CSVs. I was most eager to try out Cloudera Search. I followed the steps from the blog here. The ~/datasets/batch-tweets.sh script seemed to run fine - the MapReduceIndexer took 3 to 4 minutes and jobs seemed to succeed. I could see what looks like a Lucene index in HDFS under /solr/batch_tweets/core_node1/data/index. So far so good. I fired up the Hue Solr Search tool and tried customizing how search results are formatted. This works partially but each field in a set of results is preceded by what looks like Avro type information e.g. if the template looks like: {{text}} {{user_name}} the results preview shows the following: org.apache.avro.util.Utf8:tweet text 10782 org.apache.avro.util.Utf8:fake user10782 I also tried using avro-tools to read the sample data that the batch-tweets script pulls in for indexing: java -jar ~/avro-tools-1.7.3.jar tojson /usr/share/doc/search-1.0.0/examples/test-documents/sample-statuses-20120906-141433-medium.avro | less The avro files seemed to read just fine. Is it possible that there's been some change to the QuickStart VM since the blog was posted last summer? Any suggestions welcome.
... View more
Labels: