Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Solr indexing on hive table

Solr indexing on hive table

New Contributor

Team,

 

we are planning to index hive tables in cloudera solr to find the relative tables using data search. we don’t find any documents in cloudera site for this setup. we could see some generic document  from below link for how to index hive tables using solr. but the problem is we need to build the JAR with third party tool Gradle and also we are not sure it will support cloudera solr or not.

 

https://github.com/lucidworks/hive-solr

 

Could you please guide me how to index hive tables in cloudera solr. Thanks

13 REPLIES 13

Re: Solr indexing on hive table

Super Collaborator
What format are your hive tables? The MapReduceIndexerTool/morphlines can read the hive table files and import them into solr, depending on the format of those hive tables.

-pd

Re: Solr indexing on hive table

Explorer
I am trying to create the SOLR indexing on hive views. How can I achieve it? Is there is any document related to it please share. I went through this document (https://www.cloudera.com/documentation/enterprise/5-8-x/topics/search_batch_index_use_mapreduce.html) but it is not helpful in my case.

Re: Solr indexing on hive table

Rising Star

did anyone find out a way to index hive table .

Re: Solr indexing on hive table

Contributor

Hi,

 

Has anyone found a working solution so far?

 

Regard,

MG

Re: Solr indexing on hive table

Super Collaborator

You have basicaly two options :

- either the file format is simple enough and you can index it directly using the MapReduceIndexerTool as suggested by pdvorak (you access the file directly)

 

- either the file format is too complicated (or dynamic) and then you need to code your own indexer that will run the query on hive, get the result and then push it to solr.

Re: Solr indexing on hive table

Contributor

So, the takeaway is that there isn't an official indexer (just like for mysql) for Hive tables. 

 

Is it possible to see it in the upcoming future? or Does it even make sense?

 

I mean, I see a clear use case behind that. If Solr can index Hive tables, it would become so easy to make your Hadoop data searchable.

 

Regards,

MG

Highlighted

Re: Solr indexing on hive table

Super Collaborator

You should look at this : https://chimpler.wordpress.com/2013/03/20/playing-with-apache-hive-and-solr/

 

The content seems to be what you are looking for.

I have not tested it myself.

 

regards,

Mathieu

Re: Solr indexing on hive table

Contributor

Hi Mathieu,

 

Thanks for sharing that link; I will test with it.

 

However, I am a little skeptical about deploying it in production even if it works.

 

Does Cloudera have any plans to develop and release such a connector/handler/library?

 

As I mentioned previously, this seems to be a valid use case for allowing users to be able to search through Hive tables.

 

Regards,

MG

Re: Solr indexing on hive table

Super Collaborator
We generally discourage the use of DataImportHandlers with CDH, because of the following reasons:

1> it's a black box that you simply cannot debug what goes wrong.
2> it's not SolrCloud aware.
3> It loads a single server with the entire workload of doing the DIH thing, possibly Tika extraction, database connectivity etc., creating a bottleneck.
3a> The ingestion process is constrained to the time it takes to run the entire data set through a single node? That's running a single-threaded import process?
4> It encourages indexing tables then treating Solr like a RDBMS not a search engine.
5> It is rigid, its model is in the very old Solr single core (not even sharded!) days. And if the people who put work into it didn't anticipate your needs, you have no recourse.
6> The configuration is arcane. You'll spend as much or more time trying to understand the configuration process as you'd spend with Flume/MRIT/Morphlines.
6a> If you run into issues with DIH, the next solution is to use morphlines anyways.
7> For complex queries you often hit OOM issues _or_ the import process is terribly slow, it's ability to cache sub-queries is limited.
8> It doesn't scale. DIH runs on the Solr nodes. It was never written/supported to, say, run simultaneously on N Solr servers and distribute the load.
9> Cloudera doesn't support it as an ingestion process.
10> The ability to modify the Solr documents is extremely limited with little/no real chance of making it better.
11> It doesn't understand HDFS so importing files as opposed to simple tables isn't likely in a CDH installation.

These points were outlined by our engineering team as to why the preferred method is the MRIT instead of DIH.

One recommendation would be for the processing that is creating the hive tables, data could also be indexed into solr at the same time (via flume morphlineSolrSink or MRIT) then you will have the data available in hive, and searchable in solr.

Hue can also be used to index data into solr from files in hdfs (as of Hue 3.11 in CDH5.9 and above), please see the following tutorial for doing so:
http://gethue.com/easy-indexing-of-data-into-solr/

This leverages yarn and MRIT to index files in hdfs.

As noted in the blog post, the following are supported;

CSV Files
Hue Log Files
Combined Apache Log Files
Ruby Log File
Syslog
Beyond files, metastore tables and Hive SQL queries are also supported.

-pd
Don't have an account?
Coming from Hortonworks? Activate your account here