Reply
Expert Contributor
Posts: 88
Registered: ‎09-17-2014
Accepted Solution

Solr index not all dataset

Hi dear expert!

recentely, I met very strange behaivor of Solr. It index not all dataset.

 

I try to index CSV files (volume around 0.5 Tb).

But in results a can observe that not all fields are indexed (to be clear only one copy of data are indexed)...

 

Example:

impala query

select * from dpi2
where
msisdn=9851031305

 

 

return:

  msisdn sgsn local_ip local_port external_ip external_port translated_ip translated_port site uri _c10 _c11 bytes

09851031305NameOfSomeNode2192.168.11.187645510.62.63.170143213.10.126.1954041bing.comMXAXWVMLKRMM9NLYQGFF2014-11-30T03:11:18Z2014-11-30T03:17:17Z8073
19851031305NameOfSomeNode8192.168.11.187645510.62.63.170143213.10.126.1954041bing.comWOOHQAFFWTQUNO93XNDH2014-11-30T03:11:18Z2014-11-30T03:12:42Z445
29851031305NameOfSomeNode2192.168.11.187645510.62.63.170143213.10.126.1954041bing.comXKSLQUY2ROYYD1YMPVOI2014-11-30T03:11:18Z2014-11-30T03:17:21Z2065

 

thouthand of rows.

 

But solr search (msisdn:9851031305) return only one row...

{"showDetails":false,"session_time":"[u'2014-11-30T03:11:18Z']","event_time":"[u'2014-11-30T03:15:58Z']","local_ip":"[u'192.168.11.187']","msisdn":["9851031305"],"sgsn":"[u'NameOfSomeNode3']","translated_ip":"[u'213.10.126.195']","bytes":"[4866]","site":"[u'bing.com']","url":"[u'RHHIAU0AIZCSXDDVLUU0']","translated_port":"[u'4041']","external_ip":"[u'10.62.63.170']","local_port":"[u'6455']","_version_":1490445936214671400,"external_port":"[u'143']","id":"aec90d3e-2ecd-4be9-8f4a-572819c1a127","details":[]}

 

schema.xml file:

 

 <fields>
   <field name="msisdn"                         type="string"  indexed="true" stored="true" multiValued="true" />
   <field name="sgsn"                           type="string"  indexed="true" stored="true" multiValued="true"  />
   <field name="local_ip"                       type="string"  indexed="true" stored="true" multiValued="true" />
   <field name="local_port"             type="string"  indexed="true" stored="true" multiValued="true" />
   <field name="external_ip"            type="string"   indexed="true" stored="true" multiValued="true" />
   <field name="external_port"          type="string"  indexed="true" stored="true" multiValued="true" />
   <field name="translated_ip"          type="string"  indexed="true" stored="true" multiValued="true" />
   <field name="translated_port"        type="string"  indexed="true" stored="true" multiValued="true" />
   <field name="site"                           type="string"  indexed="true" stored="true" multiValued="true" />
   <field name="url"                            type="string"  indexed="true" stored="true" multiValued="true" />
   <field name="session_time"           type="date"  indexed="true" stored="true" multiValued="true" />
   <field name="event_time"             type="date"  indexed="true" stored="true" multiValued="true" />
   <field name="bytes"                          type="tint"     indexed="true" stored="true" multiValued="true" />
   <field name="_version_"                      type="long"     indexed="true" stored="true" multiValued="false"/>
   <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
   <dynamicField name="ignored_*" type="ignored"/>
</fields>

<uniqueKey>id</uniqueKey>

 

any support are very appreciated!

 

Cloudera Employee
Posts: 13
Registered: ‎10-16-2013

Re: Solr index not all dataset

I see your schema is using "id" as the unique key.  What values do you populate that field with?

Expert Contributor
Posts: 88
Registered: ‎09-17-2014

Re: Solr index not all dataset

Hi!

 

There is no any hidden meaning. I just don't have any uniq id and use random for that...

Highlighted
Cloudera Employee
Posts: 13
Registered: ‎10-16-2013

Re: Solr index not all dataset

Are you using Morphline for indexing? If yes, can you post your file on here?
Cloudera Employee
Posts: 25
Registered: ‎08-22-2014

Re: Solr index not all dataset

In your morphline file which is reviews.conf. change the order for { generateUUID { field : id } } move this down after readCSV morphlines : [ { id : dpi1 importCommands : ["org.kitesdk.**", "org.apache.solr.**"] commands : [ { readCSV { separator : "," columns : [msisdn,sgsn,local_ip,local_port,external_ip,external_port,translated_ip,translated_port,site,url,session_time,event_time,bytes] quoteChar : "\"" charset : UTF-8 } } { generateUUID { field : id } } { if { conditions : [ { equals { id : [] } } ] then : [ { logDebug { format : "output record: {}", args : ["@{}"] } } ] } }
Expert Contributor
Posts: 88
Registered: ‎09-17-2014

Re: Solr index not all dataset

Thanks it works!