Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

readJsonTweets Morphline using the MapReduceIndexerTool Results in No Documents in Solr

Solved Go to solution

readJsonTweets Morphline using the MapReduceIndexerTool Results in No Documents in Solr

Contributor

I am using the MapReduceIndexerTool to index some Tweets with the Morphline readJsonTweets.conf and the job completes successfully but no documents are created in Solr Search.  With Log4J set to TRACE I can see the Twitter data parsed successfully against the schema.xml defined.  I would welcome any help as to why no index data is stored.

 

Thanks

Shailesh

1 ACCEPTED SOLUTION

Accepted Solutions

Re: readJsonTweets Morphline using the MapReduceIndexerTool Results in No Documents in Solr

Expert Contributor

Looks like you are missing a loadSolr command in your morphline, for example as shown here: see http://www.cloudera.com/documentation/enterprise/latest/topics/search_batch_index_use_mapreduce.html...

 

(FYI, with MapReduceIndexerTool the SOLR_LOCATOR is substituted from whatever is specified on the CLI with --zk-host option)

9 REPLIES 9

Re: readJsonTweets Morphline using the MapReduceIndexerTool Results in No Documents in Solr

Contributor

Hi Shailesh, Are you using the --go-live option on the IndexerTool? Mike

Re: readJsonTweets Morphline using the MapReduceIndexerTool Results in No Documents in Solr

Contributor

Yes, this the command line I run:

 

HADOOP_OPTS="-Djava.security.auth.login.config=/home/shailesh/jaas-client.conf" \
hadoop --config /etc/hadoop/conf.cloudera.yarn jar \
/opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool \
-D 'mapred.child.java.opts=-Xmx500m' \
--log4j /opt/cloudera/parcels/CDH/share/doc/search*/examples/solr-nrt/log4j.properties \
--morphline-file /home/shailesh/twitter/morphlines/readJsonTweets.conf \
--output-dir hdfs://cloudman.sunnydale3.com:8020/user/shailesh/outdir \
--verbose --go-live \
--zk-host cloudman.sunnydale3.com:2181/solr \
--collection twitter_data \
hdfs://cloudman.sunnydale3.com:8020/user/shailesh/indir

 

Re: readJsonTweets Morphline using the MapReduceIndexerTool Results in No Documents in Solr

Contributor

 

Do you have access to the task logs for your Indexer job?

 

Can you confirm that you are seeing lines that look like the following (the numbers and dates will likely be different)?

 

2016-03-06 17:09:07,953 INFO [main] org.apache.solr.hadoop.SolrRecordWriter: docsWritten: 20

2016-03-06 17:09:13,984 INFO [main] org.apache.solr.hadoop.SolrRecordWriter: docsWritten: 880

 

Are there any errors that you are seeing in the map or reduce task attempt logs?

Re: readJsonTweets Morphline using the MapReduceIndexerTool Results in No Documents in Solr

Contributor

I have docsWritten = 0 on Reducer log below but I don't know why.  Btw I am trying this with a single tweet at the moment.

 

This is what I see on Mapper task:

Log Type: stdout
Log Upload Time: Fri Mar 11 23:43:11 +0000 2016
Log Length: 1843
3247 [main] INFO org.apache.solr.hadoop.SolrRecordWriter - Using this unpacked directory as solr home: /yarn/nm/usercache/shailesh/appcache/application_1457642318324_0008/container_1457642318324_0008_01_000002/70457759-3833-4f68-b75c-4456e6f9f0db.solr.zip
3249 [main] INFO org.apache.solr.hadoop.HeartBeater - Heart beat reporting class is org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context
3249 [Thread-11] INFO org.apache.solr.hadoop.HeartBeater - HeartBeat thread running
3251 [Thread-11] INFO org.apache.solr.hadoop.HeartBeater - heartbeat skipped count 0
3295 [main] INFO org.apache.solr.core.SolrResourceLoader - new SolrResourceLoader for directory: '/yarn/nm/usercache/shailesh/appcache/application_1457642318324_0008/container_1457642318324_0008_01_000002/70457759-3833-4f68-b75c-4456e6f9f0db.solr.zip/'
3830 [main] INFO org.apache.solr.update.SolrIndexConfig - IndexWriter infoStream solr logging is enabled
3837 [main] INFO org.apache.solr.core.SolrConfig - Using Lucene MatchVersion: 4.10.3
3969 [main] INFO org.apache.solr.core.Config - Loaded SolrConfig: solrconfig.xml
3981 [main] INFO org.apache.solr.schema.IndexSchema - Reading Solr Schema from /yarn/nm/usercache/shailesh/appcache/application_1457642318324_0008/container_1457642318324_0008_01_000002/70457759-3833-4f68-b75c-4456e6f9f0db.solr.zip/conf/schema.xml
3995 [main] INFO org.apache.solr.schema.IndexSchema - [null] Schema name=example
4219 [main] INFO org.apache.solr.schema.IndexSchema - unique key field: id
4443 [main] INFO org.kitesdk.morphline.api.MorphlineContext - Importing commands
6557 [main] INFO org.kitesdk.morphline.api.MorphlineContext - Done importing commands
6793 [main] INFO org.apache.solr.hadoop.morphline.MorphlineMapRunner - Processing file hdfs://cloudman.sunnydale3.com:8020/user/shailesh/indir/tiny.data

 

And on Reducer and it does say docsWritten is 0:

5895 [main] INFO org.apache.solr.update.LoggingInfoStream - [IW][main]: flush at getReader
5895 [main] INFO org.apache.solr.update.LoggingInfoStream - [DW][main]: startFullFlush
5895 [main] INFO org.apache.solr.update.LoggingInfoStream - [IW][main]: apply all deletes during flush
5895 [main] INFO org.apache.solr.update.LoggingInfoStream - [BD][main]: prune sis=segments_1: minGen=9223372036854775807 packetCount=0
5897 [main] INFO org.apache.solr.update.LoggingInfoStream - [IW][main]: return reader version=1 reader=StandardDirectoryReader(segments_1:1:nrt)
5898 [main] INFO org.apache.solr.update.LoggingInfoStream - [DW][main]: main finishFullFlush success=true
5898 [main] INFO org.apache.solr.update.LoggingInfoStream - [IW][main]: getReader took 3 msec
5914 [main] WARN org.apache.solr.rest.ManagedResourceStorage - Cannot write to config directory /yarn/nm/usercache/shailesh/appcache/application_1457642318324_0008/container_1457642318324_0008_01_000003/70457759-3833-4f68-b75c-4456e6f9f0db.solr.zip/conf; switching to use InMemory storage instead.
5915 [main] INFO org.apache.solr.rest.RestManager - Initializing RestManager with initArgs: {}
5932 [main] INFO org.apache.solr.rest.ManagedResourceStorage - Reading _rest_managed.json using InMemoryStorage
5932 [main] WARN org.apache.solr.rest.ManagedResource - No stored data found for /rest/managed
5937 [main] INFO org.apache.solr.rest.ManagedResourceStorage - Saved JSON object to path _rest_managed.json using InMemoryStorage
5937 [main] INFO org.apache.solr.rest.RestManager - Initializing 0 registered ManagedResources
5958 [main] INFO org.apache.solr.core.CoreContainer - registering core: core1
5977 [main] INFO org.apache.solr.hadoop.HeartBeater - Heart beat reporting class is org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context
5977 [main] INFO org.apache.solr.hadoop.SolrRecordWriter - docsWritten: 0
5978 [Thread-32] INFO org.apache.solr.hadoop.HeartBeater - HeartBeat thread running
5979 [Thread-32] INFO org.apache.solr.hadoop.HeartBeater - heartbeat skipped count 0
5999 [main] INFO org.apache.solr.update.UpdateHandler - start commit{,optimize=false,openSearcher=true,waitSearcher=false,expungeDeletes=false,softCommit=false,prepareCommit=false}
6000 [main] INFO org.apache.solr.update.UpdateHandler - No uncommitted changes. Skipping IW.commit.
6003 [main] INFO org.apache.solr.update.UpdateHandler - end_commit_flush
6011 [main] INFO org.apache.solr.hadoop.BatchWriter - Optimizing Solr: forcing merge down to 1 segments
6011 [main] INFO org.apache.solr.update.UpdateHandler - start commit{,optimize=true,openSearcher=true,waitSearcher=false,expungeDeletes=false,softCommit=false,prepareCommit=false}
6012 [main] INFO org.apache.solr.update.LoggingInfoStream - [IW][main]: forceMerge: index now
6012 [main] INFO org.apache.solr.update.LoggingInfoStream - [IW][main]: now flush at forceMerge
6012 [main] INFO org.apache.solr.update.LoggingInfoStream - [IW][main]: start flush: applyAllDeletes=true
6012 [main] INFO org.apache.solr.update.LoggingInfoStream - [IW][main]: index before flush
6012 [main] INFO org.apache.solr.update.LoggingInfoStream - [DW][main]: startFullFlush
6012 [main] INFO org.apache.solr.update.LoggingInfoStream - [DW][main]: main finishFullFlush success=true
6012 [main] INFO org.apache.solr.update.LoggingInfoStream - [IW][main]: apply all deletes during flush
6012 [main] INFO org.apache.solr.update.LoggingInfoStream - [BD][main]: prune sis=segments_1: minGen=9223372036854775807 packetCount=0
6012 [main] INFO org.apache.solr.update.LoggingInfoStream - [TMP][main]: findForcedMerges maxSegmentCount=1 infos= segmentsToMerge={}
6012 [main] INFO org.apache.solr.update.LoggingInfoStream - [CMS][main]: now merge
6012 [main] INFO org.apache.solr.update.LoggingInfoStream - [CMS][main]: index:
6012 [main] INFO org.apache.solr.update.LoggingInfoStream - [CMS][main]: no more merges pending; now return
6012 [main] INFO org.apache.solr.update.UpdateHandler - No uncommitted changes. Skipping IW.commit.
6017 [main] INFO org.apache.solr.update.UpdateHandler - end_commit_flush

 

Re: readJsonTweets Morphline using the MapReduceIndexerTool Results in No Documents in Solr

Contributor

How did you generate the morphline file?

 

Have you had a chance to look at our tutorials? Specifically, http://www.cloudera.com/documentation/enterprise/latest/topics/search_batch_index_use_mapreduce.html gives an example morphline file -- can you compare?

 

Also, you can try running with --dry-run to get faster test executions and better debug what is going on.

Re: readJsonTweets Morphline using the MapReduceIndexerTool Results in No Documents in Solr

Contributor

I used the tutorial to set up the activity of using the MRIT.  I used the example readJsonTweets.conf provided with the examples on CDH 5.6:

 

morphlines : [
  {
    id : morphline1
    importCommands : ["org.kitesdk.**", "org.apache.solr.**"]

    commands : [
      {
        readJsonTestTweets {
          isLengthDelimited : false
        }
      }
      { logDebug { format : "output record: {}", args : ["@{}"] } }
    ]
  }
]

The only thing that is missing is the SOLR_LOCATOR which I specify on MapReduceIndexerTool command line.  Does that make a difference?

 

Re: readJsonTweets Morphline using the MapReduceIndexerTool Results in No Documents in Solr

Contributor

Looking at the Reducer logs I see the following error which refers to collection1 which I don't have in any of my SolrCores. Does this imply Zookeeper config corruption? I also tried but failed to initialise Solr config in ZooKeeper using CM Solr Initialise, solrctl init --force and zookeeper-client rmr /solr.  But non have successfully supported the initialisation process for Solr.  Also removed the usercache on all the NodeManager/DataNode nodes but this didn't cure the problem.

 

6588 [main] INFO org.apache.solr.hadoop.HeartBeater - Heart beat reporting class is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
6591 [main] INFO org.apache.solr.hadoop.SolrRecordWriter - Using this unpacked directory as solr home: /yarn/nm/usercache/shailesh/appcache/application_1457801342867_0014/container_1457801342867_0014_01_000020/39534f41-0b65-4b38-9d02-52c443e85da4.solr.zip
6592 [main] INFO org.apache.solr.hadoop.SolrRecordWriter - Creating embedded Solr server with solrHomeDir: /yarn/nm/usercache/shailesh/appcache/application_1457801342867_0014/container_1457801342867_0014_01_000020/39534f41-0b65-4b38-9d02-52c443e85da4.solr.zip, fs: DFS[DFSClient[clientName=DFSClient_attempt_1457801342867_0014_r_000000_0_1776230373_1, ugi=shailesh (auth:SIMPLE)]], outputShardDir: hdfs://cloudman.sunnydale3.com:8020/user/shailesh/outdir/reducers/_temporary/1/_temporary/attempt_1457801342867_0014_r_000000_0/part-r-00000
6590 [Thread-22] INFO org.apache.solr.hadoop.HeartBeater - HeartBeat thread running
6609 [Thread-22] INFO org.apache.solr.hadoop.HeartBeater - Issuing heart beat for 1 threads
6625 [main] INFO org.apache.solr.core.SolrResourceLoader - new SolrResourceLoader for directory: '/yarn/nm/usercache/shailesh/appcache/application_1457801342867_0014/container_1457801342867_0014_01_000020/39534f41-0b65-4b38-9d02-52c443e85da4.solr.zip/'
6896 [main] INFO org.apache.solr.hadoop.SolrRecordWriter - Constructed instance information solr.home /yarn/nm/usercache/shailesh/appcache/application_1457801342867_0014/container_1457801342867_0014_01_000020/39534f41-0b65-4b38-9d02-52c443e85da4.solr.zip (/yarn/nm/usercache/shailesh/appcache/application_1457801342867_0014/container_1457801342867_0014_01_000020/39534f41-0b65-4b38-9d02-52c443e85da4.solr.zip), instance dir /yarn/nm/usercache/shailesh/appcache/application_1457801342867_0014/container_1457801342867_0014_01_000020/39534f41-0b65-4b38-9d02-52c443e85da4.solr.zip/, conf dir /yarn/nm/usercache/shailesh/appcache/application_1457801342867_0014/container_1457801342867_0014_01_000020/39534f41-0b65-4b38-9d02-52c443e85da4.solr.zip/conf/, writing index to solr.data.dir hdfs://cloudman.sunnydale3.com:8020/user/shailesh/outdir/reducers/_temporary/1/_temporary/attempt_1457801342867_0014_r_000000_0/part-r-00000/data, with permdir hdfs://cloudman.sunnydale3.com:8020/user/shailesh/outdir/reducers/_temporary/1/_temporary/attempt_1457801342867_0014_r_000000_0/part-r-00000
6908 [main] INFO org.apache.solr.core.ConfigSolr - Loading container configuration from /yarn/nm/usercache/shailesh/appcache/application_1457801342867_0014/container_1457801342867_0014_01_000020/39534f41-0b65-4b38-9d02-52c443e85da4.solr.zip/solr.xml
6912 [main] INFO org.apache.solr.core.ConfigSolr - /yarn/nm/usercache/shailesh/appcache/application_1457801342867_0014/container_1457801342867_0014_01_000020/39534f41-0b65-4b38-9d02-52c443e85da4.solr.zip/solr.xml does not exist, using default configuration
7184 [main] INFO org.apache.solr.core.CoreContainer - New CoreContainer 1025566832
7184 [main] INFO org.apache.solr.core.CoreContainer - Loading cores into CoreContainer [instanceDir=/yarn/nm/usercache/shailesh/appcache/application_1457801342867_0014/container_1457801342867_0014_01_000020/39534f41-0b65-4b38-9d02-52c443e85da4.solr.zip/]
7198 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting socketTimeout to: 0
7198 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting urlScheme to: null
7198 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting connTimeout to: 0
7198 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting maxConnectionsPerHost to: 20
7198 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting corePoolSize to: 0
7198 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting maximumPoolSize to: 2147483647
7198 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting maxThreadIdleTime to: 5
7198 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting sizeOfQueue to: -1
7198 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting fairnessPolicy to: false
7198 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting useRetries to: false
7362 [main] INFO org.apache.solr.logging.LogWatcher - SLF4J impl is org.slf4j.impl.Log4jLoggerFactory
7363 [main] INFO org.apache.solr.logging.LogWatcher - Registering Log Listener [Log4j (org.slf4j.impl.Log4jLoggerFactory)]
7364 [main] INFO org.apache.solr.core.CoreContainer - Host Name:
7424 [coreLoadExecutor-5-thread-1] INFO org.apache.solr.core.SolrResourceLoader - new SolrResourceLoader for directory: '/yarn/nm/usercache/shailesh/appcache/application_1457801342867_0014/container_1457801342867_0014_01_000020/39534f41-0b65-4b38-9d02-52c443e85da4.solr.zip/collection1/'
7461 [coreLoadExecutor-5-thread-1] ERROR org.apache.solr.core.CoreContainer - Error creating core [collection1]: Could not load conf for core collection1: Error loading solr config from /yarn/nm/usercache/shailesh/appcache/application_1457801342867_0014/container_1457801342867_0014_01_000020/39534f41-0b65-4b38-9d02-52c443e85da4.solr.zip/collection1/conf/solrconfig.xml
org.apache.solr.common.SolrException: Could not load conf for core collection1: Error loading solr config from /yarn/nm/usercache/shailesh/appcache/application_1457801342867_0014/container_1457801342867_0014_01_000020/39534f41-0b65-4b38-9d02-52c443e85da4.solr.zip/collection1/conf/solrconfig.xml
at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:68)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:496)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:262)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:256)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.common.SolrException: Error loading solr config from /yarn/nm/usercache/shailesh/appcache/application_1457801342867_0014/container_1457801342867_0014_01_000020/39534f41-0b65-4b38-9d02-52c443e85da4.solr.zip/collection1/conf/solrconfig.xml
at org.apache.solr.core.SolrConfig.readFromResourceLoader(SolrConfig.java:154)
at org.apache.solr.core.ConfigSetService.createSolrConfig(ConfigSetService.java:82)
at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62)
... 7 more
Caused by: org.apache.solr.core.SolrResourceNotFoundException: Can't find resource 'solrconfig.xml' in classpath or '/yarn/nm/usercache/shailesh/appcache/application_1457801342867_0014/container_1457801342867_0014_01_000020/39534f41-0b65-4b38-9d02-52c443e85da4.solr.zip/collection1/conf'
at org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:362)
at org.apache.solr.core.SolrResourceLoader.openConfig(SolrResourceLoader.java:308)
at org.apache.solr.core.Config.<init>(Config.java:117)
at org.apache.solr.core.Config.<init>(Config.java:87)
at org.apache.solr.core.SolrConfig.<init>(SolrConfig.java:167)
at org.apache.solr.core.SolrConfig.readFromResourceLoader(SolrConfig.java:145)
... 9 more
7466 [main] INFO org.apache.solr.core.SolrResourceLoader - new SolrResourceLoader for directory: '/yarn/nm/usercache/shailesh/appcache/application_1457801342867_0014/container_1457801342867_0014_01_000020/39534f41-0b65-4b38-9d02-52c443e85da4.solr.zip/'
7511 [main] INFO org.apache.solr.update.SolrIndexConfig - IndexWriter infoStream solr logging is enabled
7517 [main] INFO org.apache.solr.core.SolrConfig - Using Lucene MatchVersion: 4.10.3
7619 [main] INFO org.apache.solr.core.Config - Loaded SolrConfig: solrconfig.xml
7634 [main] INFO org.apache.solr.schema.IndexSchema - Reading Solr Schema from /yarn/nm/usercache/shailesh/appcache/application_1457801342867_0014/container_1457801342867_0014_01_000020/39534f41-0b65-4b38-9d02-52c443e85da4.solr.zip/conf/schema.xml
7647 [main] INFO org.apache.solr.schema.IndexSchema - [core1] Schema name=example
7865 [main] INFO org.apache.solr.schema.IndexSchema - unique key field: id
8063 [main] INFO org.apache.solr.core.ConfigSetProperties - Did not find ConfigSet properties, assuming default properties: Can't find resource 'configsetprops.json' in classpath or '/yarn/nm/usercache/shailesh/appcache/application_1457801342867_0014/container_1457801342867_0014_01_000020/39534f41-0b65-4b38-9d02-52c443e85da4.solr.zip/conf'
8064 [main] INFO org.apache.solr.core.CoreContainer - Creating SolrCore 'core1' using configuration from instancedir /yarn/nm/usercache/shailesh/appcache/application_1457801342867_0014/container_1457801342867_0014_01_000020/39534f41-0b65-4b38-9d02-52c443e85da4.solr.zip/
8120 [main] INFO org.apache.solr.core.HdfsDirectoryFactory - Solr Kerberos Authentication disabled

Re: readJsonTweets Morphline using the MapReduceIndexerTool Results in No Documents in Solr

Expert Contributor

Looks like you are missing a loadSolr command in your morphline, for example as shown here: see http://www.cloudera.com/documentation/enterprise/latest/topics/search_batch_index_use_mapreduce.html...

 

(FYI, with MapReduceIndexerTool the SOLR_LOCATOR is substituted from whatever is specified on the CLI with --zk-host option)

Re: readJsonTweets Morphline using the MapReduceIndexerTool Results in No Documents in Solr

Contributor

Excellent that solved the problem along with some unknown fields in the schema.xml where I have now used sanitizeUnknownSolrFields in the Morphline .  Thanks for your help.