Reply
New Contributor
Posts: 1
Registered: ‎09-08-2016

MapReduceIndexerTool erroring with max_array_length

Hello, 

I am trying to use the MapReduceIndexerTool to index data in a hive table to Solr Cloud / Cloudera Search. 

The tool is failing the job with the following error 

 

1799 [main] INFO  org.apache.solr.hadoop.MapReduceIndexerTool  - Indexing 1 files using 1 real mappers into 10 reducers

Error: MAX_ARRAY_LENGTH

Error: MAX_ARRAY_LENGTH

Error: MAX_ARRAY_LENGTH

36962 [main] ERROR org.apache.solr.hadoop.MapReduceIndexerTool  - Job failed! jobName: org.apache.solr.hadoop.MapReduceIndexerTool/MorphlineMapper, jobId: job_1473161870114_0339

 

The error stack trace is 

2016-09-08 10:39:20,128 ERROR [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.NoSuchFieldError: MAX_ARRAY_LENGTH
	at org.apache.lucene.codecs.memory.DirectDocValuesFormat.<clinit>(DirectDocValuesFormat.java:58)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at java.lang.Class.newInstance(Class.java:374)
	at org.apache.lucene.util.NamedSPILoader.reload(NamedSPILoader.java:67)
	at org.apache.lucene.util.NamedSPILoader.<init>(NamedSPILoader.java:47)
	at org.apache.lucene.util.NamedSPILoader.<init>(NamedSPILoader.java:37)
	at org.apache.lucene.codecs.DocValuesFormat.<clinit>(DocValuesFormat.java:43)
	at org.apache.solr.core.SolrResourceLoader.reloadLuceneSPI(SolrResourceLoader.java:205)

 

 

My Schema.xml looks like

 

<fields>

   <field name="dataset_id" type="string" indexed="true" stored="true" required="true" multiValued="false" docValue="true" />

   <field name="search_string" type="string" indexed="true" stored="true" docValue="true"/>

   <field name="_version_" type="long" indexed="true" stored="true"/>

</fields>

 

 

<!-- Field to use to determine and enforce document uniqueness.

      Unless this field is marked with required="false", it will be a required field

   -->

<uniqueKey>dataset_id</uniqueKey>

 

 

I am otherwise about to post documents using Solr APIs / upload methods. Only the MapReduceIndexer tool is failing. 

 

The command I am using is 

hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx500m' --log4j /opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/share/doc/search-1.0.0+cdh5.7.0+0/examples/solr-nrt/log4j.properties --morphline-file /home/$USER/morphline2.conf --output-dir hdfs://NNHOST:8020/user/$USER/outdir --verbose --zk-host ZKHOST:2181/solr1 --collection dataCatalog_search_index hdfs://NNHOST:8020/user/hive/warehouse/name.db/concatenated_index4/;

 

My morphline config looks like

 

SOLR_LOCATOR : {

  # Name of solr collection

  collection : search_index

 

  # ZooKeeper ensemble

  $zkHost:2181/solr1"

}

 

# Specify an array of one or more morphlines, each of which defines an ETL

# transformation chain. A morphline consists of one or more (potentially

# nested) commands. A morphline is a way to consume records (e.g. Flume events,

# HDFS files or blocks), turn them into a stream of records, and pipe the stream

# of records through a set of easily configurable transformations on the way to

# a target application such as Solr.

morphlines : [

  {

    id : search_index

    importCommands : ["org.kitesdk.**", "org.apache.solr.**"]

    commands : [

      {

        readCSV {

          separator : ","

          columns : [dataset_id,search_string]

          ignoreFirstLine : true

          charset : UTF-8

        }

      }

 

 

      # Consume the output record of the previous command and pipe another

      # record downstream.

      #

      # Command that deletes record fields that are unknown to Solr

      # schema.xml.

      #

      # Recall that Solr throws an exception on any attempt to load a document

      # that contains a field that isn't specified in schema.xml.

      {

        sanitizeUnknownSolrFields {

          # Location from which to fetch Solr schema

          solrLocator : ${SOLR_LOCATOR}

        }

      }

 

      # log the record at DEBUG level to SLF4J

      { logDebug { format : "output record: {}", args : ["@{}"] } }

 

      # load the record into a Solr server or MapReduce Reducer

      {

        loadSolr {         

          solrLocator : ${SOLR_LOCATOR}

        }

      }

    ]

  }

]

 

 

Please let me know if I am going anything wrong. 

 

Expert Contributor
Posts: 173
Registered: ‎09-29-2014

Re: MapReduceIndexerTool erroring with max_array_length

if it's failed while you built hive data into solr , pls try build data from hbase to solr.

then first you need create external table to link hbase table, then insert your hive data into external table.

 

i have tried many times to load hbase into solr , no any problem.

 

hadoop --config  /etc/hadoop/conf   \ 
jar /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar \ 
--conf  /etc/hbase/conf/hbase-site.xml -D 'mapred.child.java.opts=-Xmx500m'
--hbase-indexer-file /opt/hbase-indexers/saic_sms_flow/morphline-hbase-mapper.xml  
--zk-host jq-zk03.hadoop,jq-zk02.hadoop,jq-zk01.hadoop/solr --collection saic_sms_flow --reducers 0