Reply
Highlighted
New Contributor
Posts: 1
Registered: ‎02-08-2017

Morphlines MapReduceIndexerTool - CSV to Solr - Java heap space - GC overhead limit exceeded

Hi.

I have a working Morphlines file, by using a regular Java class (standalone).

It works - the morphlines just read CSV and put them into Solr, fairly simple.

But when using MapReduceIndexerTool with the same Morphlines file, it fails because of heap memory error (see output below).

By the way, the test file I'm using it's less than 1KB, so that cannot be the case.

 

I'm trying to run it through MapReduceIndexerTool - but it's failing due to memory. 

 

 sudo hadoop --config /etc/hadoop/conf.cloudera.yarn jar \
  /usr/lib/solr/contrib/mr/search-mr-1.0.0-cdh5.8.0-job.jar \
  org.apache.solr.hadoop.MapReduceIndexerTool \
  --log4j /usr/share/doc/search-1.0.0+cdh5.8.0+0/examples/solr-nrt/log4j.properties \
  --morphline-file   /path/test_morphlines.conf \
  --output-dir hdfs://localhost:8020/hdfspath/outdir --verbose --go-live \
  --zk-host 127.0.0.1:2181/solr --collection test-memory  \
    $1

(where $1 is the HDFS file). 

 This is the output (some lines ommitted):  

 

[cloudera@quickstart solr_test_diego]$ sh launcher-MRIT.sh <hdfs-file-path>
0 [main] INFO org.apache.solr.common.cloud.SolrZkClient - Using default ZkCredentialsProvider
70 [main] INFO org.apache.solr.common.cloud.ConnectionManager - Waiting for client to connect to ZooKeeper
114 [main-EventThread] INFO org.apache.solr.common.cloud.ConnectionManager - Watcher org.apache.solr.common.cloud.ConnectionManager@6aeca5f5 name:ZooKeeperConnection Watcher:127.0.0.1:2181/solr got event WatchedEvent state:SyncConnected type:None path:null path:null type:None
124 [main] INFO org.apache.solr.common.cloud.ConnectionManager - Client is connected to ZooKeeper
124 [main] INFO org.apache.solr.common.cloud.SolrZkClient - Using default ZkACLProvider
165 [main] INFO org.apache.solr.common.cloud.ZkStateReader - Updating cluster state from ZooKeeper...
868 [main] INFO org.apache.solr.hadoop.MapReduceIndexerTool - Cluster reports 10 mapper slots
1581 [main] INFO org.apache.solr.hadoop.MapReduceIndexerTool - Cluster reports 2 reduce slots
1582 [main] INFO org.apache.solr.hadoop.MapReduceIndexerTool - Using these parameters: numFiles: 1, mappers: 80, realMappers: 1, reducers: 1, shards: 1, fanout: 2147483647, maxSegments: 1
...
1790 [main] INFO org.apache.solr.common.cloud.ConnectionManager - Client is connected to ZooKeeper 1790 [main] INFO org.apache.solr.common.cloud.SolrZkClient - Using default ZkACLProvider 1799 [main] INFO org.apache.solr.hadoop.ZooKeeperInspector - Load collection config from:/collections/test-memory 1864 [main] INFO org.apache.solr.cloud.ZkController - Write file /tmp/1486564786933-0/admin-extra.menu-top.html 1897 [main] INFO org.apache.solr.cloud.ZkController - Write file /tmp/1486564786933-0/currency.xml ...  [main] INFO org.apache.solr.cloud.ZkController - Write file /tmp/1486564786933-0/clustering/carrot2/kmeans-attributes.xml 2934 [main] INFO org.apache.solr.hadoop.MapReduceIndexerTool - Indexing 1 files using 1 real mappers into 1 reducers Error: GC overhead limit exceeded Error: Java heap space Error: Java heap space 202839 [main] ERROR org.apache.solr.hadoop.MapReduceIndexerTool - Job failed! jobName: org.apache.solr.hadoop.MapReduceIndexerTool/MorphlineMapper, jobId: job_1486138592906_0008

  

I have set map memory to high values, to no effect: 

 

-D 'mapred.child.java.opts=-Xmx4096m mapred.map.child.java.opts=-Xmx2048m \
mapred.reduce.child.java.opts=-Xmx2048m' \

May this be because of some YARN setting? I have also increased this setting in the CM configuration for YARN,

 

 

mapreduce.map.memory.mb 1G
mapreduce.reduce.memory.mb 1G

 

I'm using CDH5.8 sandbox virtualized with 30G memory (apparently the memory consumption max goes to 11G / 30G so it does not seem to be about the physical memory available). I had memory problems with Solr, but solved them by increasing heap memory and direct buffer memory. 

 

As a side note, 

I have also noted on the trace above the following lines:

 

868  [main] INFO  org.apache.solr.hadoop.MapReduceIndexerTool  - Cluster reports 10 mapper slots
1581 [main] INFO  org.apache.solr.hadoop.MapReduceIndexerTool  - Cluster reports 2 reduce slots
1582 [main] INFO  org.apache.solr.hadoop.MapReduceIndexerTool  - Using these parameters: numFiles: 1, mappers: 80, realMappers: 1, reducers: 1, shards: 1, fanout: 2147483647, maxSegments: 1

I don't really get these numbers. What do the "80 mappers" mean here? I guess that the realMappers (1) is the real value, but where these 80 come from? Does this have anything to do with my problem? 

 

And, what does the fanout parameter mean here?

 

Thanks! 

Announcements
The Kite SDK is a collection of docs, sample code, APIs, and tools to make Hadoop application development faster. Learn more at http://kitesdk.org.