Cloudera Employee
Posts: 177
Registered: ‎01-09-2014

Re: Solr indexing on hive table

We generally discourage the use of DataImportHandlers with CDH, because of the following reasons:

1> it's a black box that you simply cannot debug what goes wrong.
2> it's not SolrCloud aware.
3> It loads a single server with the entire workload of doing the DIH thing, possibly Tika extraction, database connectivity etc., creating a bottleneck.
3a> The ingestion process is constrained to the time it takes to run the entire data set through a single node? That's running a single-threaded import process?
4> It encourages indexing tables then treating Solr like a RDBMS not a search engine.
5> It is rigid, its model is in the very old Solr single core (not even sharded!) days. And if the people who put work into it didn't anticipate your needs, you have no recourse.
6> The configuration is arcane. You'll spend as much or more time trying to understand the configuration process as you'd spend with Flume/MRIT/Morphlines.
6a> If you run into issues with DIH, the next solution is to use morphlines anyways.
7> For complex queries you often hit OOM issues _or_ the import process is terribly slow, it's ability to cache sub-queries is limited.
8> It doesn't scale. DIH runs on the Solr nodes. It was never written/supported to, say, run simultaneously on N Solr servers and distribute the load.
9> Cloudera doesn't support it as an ingestion process.
10> The ability to modify the Solr documents is extremely limited with little/no real chance of making it better.
11> It doesn't understand HDFS so importing files as opposed to simple tables isn't likely in a CDH installation.

These points were outlined by our engineering team as to why the preferred method is the MRIT instead of DIH.

One recommendation would be for the processing that is creating the hive tables, data could also be indexed into solr at the same time (via flume morphlineSolrSink or MRIT) then you will have the data available in hive, and searchable in solr.

Hue can also be used to index data into solr from files in hdfs (as of Hue 3.11 in CDH5.9 and above), please see the following tutorial for doing so:

This leverages yarn and MRIT to index files in hdfs.

As noted in the blog post, the following are supported;

CSV Files
Hue Log Files
Combined Apache Log Files
Ruby Log File
Beyond files, metastore tables and Hive SQL queries are also supported.

Posts: 11
Topics: 0
Kudos: 1
Solutions: 0
Registered: ‎06-14-2017

Re: Solr indexing on hive table

pdvorak wrote:
Beyond files, metastore tables and Hive SQL queries are also supported.

Does that part of your answer suggest that MRIT supports Hive queries as data source for Solr indexing ?

If yes how ?

Posts: 19
Registered: ‎02-15-2016

Re: Solr indexing on hive table

Thanks for the detailed explanation of issues with DIH way.


I agree that it's better to send data to Solr while you are ingesting it to HDFS/Hive tables, but what about the data which is already there in Hive tables for a different type of use case?


Assume a scenario where there's an initial use case to bring RDBMS data from two different sources into Hive tables and being able to mash them up. In this case, the data will be in some kind of container like Parquet. After the initial use case is proved i.e. the data mashup is done for both sources and an ongoing processing pipeline is defined, another use case comes up where you need to be able to search through that data. How do you think that would be achieved?

Cloudera Employee
Posts: 177
Registered: ‎01-09-2014

Re: Solr indexing on hive table

What is the format of your hive tables in HDFS? You can use the MRIT [1] to index files in hdfs with the appropriate morphlines read statements. If they are csv, then you would just need to use the readCSV command. You could also use readAvro or readAvroParquetFile.