Support Questions

Find answers, ask questions, and share your expertise

Solr index not all dataset

avatar
Rising Star

Hi dear expert!

recentely, I met very strange behaivor of Solr. It index not all dataset.

 

I try to index CSV files (volume around 0.5 Tb).

But in results a can observe that not all fields are indexed (to be clear only one copy of data are indexed)...

 

Example:

impala query

select * from dpi2
where
msisdn=9851031305

 

 

return:

  msisdn sgsn local_ip local_port external_ip external_port translated_ip translated_port site uri _c10 _c11 bytes

09851031305NameOfSomeNode2192.168.11.187645510.62.63.170143213.10.126.1954041bing.comMXAXWVMLKRMM9NLYQGFF2014-11-30T03:11:18Z2014-11-30T03:17:17Z8073
19851031305NameOfSomeNode8192.168.11.187645510.62.63.170143213.10.126.1954041bing.comWOOHQAFFWTQUNO93XNDH2014-11-30T03:11:18Z2014-11-30T03:12:42Z445
29851031305NameOfSomeNode2192.168.11.187645510.62.63.170143213.10.126.1954041bing.comXKSLQUY2ROYYD1YMPVOI2014-11-30T03:11:18Z2014-11-30T03:17:21Z2065

 

thouthand of rows.

 

But solr search (msisdn:9851031305) return only one row...

{"showDetails":false,"session_time":"[u'2014-11-30T03:11:18Z']","event_time":"[u'2014-11-30T03:15:58Z']","local_ip":"[u'192.168.11.187']","msisdn":["9851031305"],"sgsn":"[u'NameOfSomeNode3']","translated_ip":"[u'213.10.126.195']","bytes":"[4866]","site":"[u'bing.com']","url":"[u'RHHIAU0AIZCSXDDVLUU0']","translated_port":"[u'4041']","external_ip":"[u'10.62.63.170']","local_port":"[u'6455']","_version_":1490445936214671400,"external_port":"[u'143']","id":"aec90d3e-2ecd-4be9-8f4a-572819c1a127","details":[]}

 

schema.xml file:

 

 <fields>
   <field name="msisdn"                         type="string"  indexed="true" stored="true" multiValued="true" />
   <field name="sgsn"                           type="string"  indexed="true" stored="true" multiValued="true"  />
   <field name="local_ip"                       type="string"  indexed="true" stored="true" multiValued="true" />
   <field name="local_port"             type="string"  indexed="true" stored="true" multiValued="true" />
   <field name="external_ip"            type="string"   indexed="true" stored="true" multiValued="true" />
   <field name="external_port"          type="string"  indexed="true" stored="true" multiValued="true" />
   <field name="translated_ip"          type="string"  indexed="true" stored="true" multiValued="true" />
   <field name="translated_port"        type="string"  indexed="true" stored="true" multiValued="true" />
   <field name="site"                           type="string"  indexed="true" stored="true" multiValued="true" />
   <field name="url"                            type="string"  indexed="true" stored="true" multiValued="true" />
   <field name="session_time"           type="date"  indexed="true" stored="true" multiValued="true" />
   <field name="event_time"             type="date"  indexed="true" stored="true" multiValued="true" />
   <field name="bytes"                          type="tint"     indexed="true" stored="true" multiValued="true" />
   <field name="_version_"                      type="long"     indexed="true" stored="true" multiValued="false"/>
   <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
   <dynamicField name="ignored_*" type="ignored"/>
</fields>

<uniqueKey>id</uniqueKey>

 

any support are very appreciated!

 

1 ACCEPTED SOLUTION

avatar
Rising Star
In your morphline file which is reviews.conf. change the order for { generateUUID { field : id } } move this down after readCSV morphlines : [ { id : dpi1 importCommands : ["org.kitesdk.**", "org.apache.solr.**"] commands : [ { readCSV { separator : "," columns : [msisdn,sgsn,local_ip,local_port,external_ip,external_port,translated_ip,translated_port,site,url,session_time,event_time,bytes] quoteChar : "\"" charset : UTF-8 } } { generateUUID { field : id } } { if { conditions : [ { equals { id : [] } } ] then : [ { logDebug { format : "output record: {}", args : ["@{}"] } } ] } }

View solution in original post

5 REPLIES 5

avatar
Cloudera Employee

I see your schema is using "id" as the unique key.  What values do you populate that field with?

avatar
Rising Star

Hi!

 

There is no any hidden meaning. I just don't have any uniq id and use random for that...

avatar
Cloudera Employee
Are you using Morphline for indexing? If yes, can you post your file on here?

avatar
Rising Star
In your morphline file which is reviews.conf. change the order for { generateUUID { field : id } } move this down after readCSV morphlines : [ { id : dpi1 importCommands : ["org.kitesdk.**", "org.apache.solr.**"] commands : [ { readCSV { separator : "," columns : [msisdn,sgsn,local_ip,local_port,external_ip,external_port,translated_ip,translated_port,site,url,session_time,event_time,bytes] quoteChar : "\"" charset : UTF-8 } } { generateUUID { field : id } } { if { conditions : [ { equals { id : [] } } ] then : [ { logDebug { format : "output record: {}", args : ["@{}"] } } ] } }

avatar
Rising Star
Thanks it works!