Support Questions

Find answers, ask questions, and share your expertise

Avro type information shown with indexed fields when trying out Cloudera Search using QuickStart VM.

avatar
New Contributor

I'm trying out the Cloudera QuickStart VM and found it pretty straightforward to try out basic WordCount M/R example, Hive queries on sample CSVs. I was most eager to try out Cloudera Search.

 

I followed the steps from the blog here. The ~/datasets/batch-tweets.sh script seemed to run fine - the MapReduceIndexer took 3 to 4 minutes and jobs seemed to succeed. I could see what looks like a Lucene index in HDFS under /solr/batch_tweets/core_node1/data/index. So far so good. I fired up the Hue Solr Search tool and tried customizing how search results are formatted. This works partially but each field in a set of results is preceded by what looks like Avro type information e.g. if the template looks like: {{text}} {{user_name}} the results preview shows the following:

org.apache.avro.util.Utf8:tweet text 10782 org.apache.avro.util.Utf8:fake user10782

 

I also tried using avro-tools to read the sample data that the batch-tweets script pulls in for indexing:

java -jar ~/avro-tools-1.7.3.jar tojson  /usr/share/doc/search-1.0.0/examples/test-documents/sample-statuses-20120906-141433-medium.avro | less

 

The avro files seemed to read just fine.

 

Is it possible that there's been some change to the QuickStart VM since the blog was posted last summer? Any suggestions welcome.

1 ACCEPTED SOLUTION

avatar
Super Collaborator
Make sure to run search-1.1.0.

View solution in original post

4 REPLIES 4

avatar
New Contributor

A small additional piece of information is that by exploring the contents of the SOLR index via the Solr Admin web UI I can see that certain fields do indeed seem to be indexed with "org.apache.avro.util.Utf8:" prefix on the original strings. The fields in question are:

  • user_screen_name
  • user_location
  • text
  • user_name
  • source

From the batch_tweets.sh script I can see how it invokes the MapReduceIndexerTool pointing at the batch_tweets_indir location in HDFS (which contains the input data in avro format). From what I can understand I believe the morphline may be key to processeding the input data in HDFS and passing on to the indexer. Doe anybody know if that's a good place to dig further or should I look into the source code for MapReduceIndexerTool?

avatar
Super Collaborator
I think this has been fixed in more recent versions of Cloudera Search.

avatar
New Contributor

That's good to know. I believe I have the most recent quickstart VM (4.4.0-1). Are updated versions of the VMs made available regularly? Or do you know if this something that can be "patched" within the VM? (I'd like to demo something based on the search functionality with a view to requesting that our enterprise (cloudera but not sure what CDH version yet) cluster have Search enabled..)

 

avatar
Super Collaborator
Make sure to run search-1.1.0.