Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Solr index not all dataset

SOLVED Go to solution

Solr index not all dataset

Rising Star

Hi dear expert!

recentely, I met very strange behaivor of Solr. It index not all dataset.

 

I try to index CSV files (volume around 0.5 Tb).

But in results a can observe that not all fields are indexed (to be clear only one copy of data are indexed)...

 

Example:

impala query

select * from dpi2
where
msisdn=9851031305

 

 

return:

  msisdn sgsn local_ip local_port external_ip external_port translated_ip translated_port site uri _c10 _c11 bytes

09851031305NameOfSomeNode2192.168.11.187645510.62.63.170143213.10.126.1954041bing.comMXAXWVMLKRMM9NLYQGFF2014-11-30T03:11:18Z2014-11-30T03:17:17Z8073
19851031305NameOfSomeNode8192.168.11.187645510.62.63.170143213.10.126.1954041bing.comWOOHQAFFWTQUNO93XNDH2014-11-30T03:11:18Z2014-11-30T03:12:42Z445
29851031305NameOfSomeNode2192.168.11.187645510.62.63.170143213.10.126.1954041bing.comXKSLQUY2ROYYD1YMPVOI2014-11-30T03:11:18Z2014-11-30T03:17:21Z2065

 

thouthand of rows.

 

But solr search (msisdn:9851031305) return only one row...

{"showDetails":false,"session_time":"[u'2014-11-30T03:11:18Z']","event_time":"[u'2014-11-30T03:15:58Z']","local_ip":"[u'192.168.11.187']","msisdn":["9851031305"],"sgsn":"[u'NameOfSomeNode3']","translated_ip":"[u'213.10.126.195']","bytes":"[4866]","site":"[u'bing.com']","url":"[u'RHHIAU0AIZCSXDDVLUU0']","translated_port":"[u'4041']","external_ip":"[u'10.62.63.170']","local_port":"[u'6455']","_version_":1490445936214671400,"external_port":"[u'143']","id":"aec90d3e-2ecd-4be9-8f4a-572819c1a127","details":[]}

 

schema.xml file:

 

 <fields>
   <field name="msisdn"                         type="string"  indexed="true" stored="true" multiValued="true" />
   <field name="sgsn"                           type="string"  indexed="true" stored="true" multiValued="true"  />
   <field name="local_ip"                       type="string"  indexed="true" stored="true" multiValued="true" />
   <field name="local_port"             type="string"  indexed="true" stored="true" multiValued="true" />
   <field name="external_ip"            type="string"   indexed="true" stored="true" multiValued="true" />
   <field name="external_port"          type="string"  indexed="true" stored="true" multiValued="true" />
   <field name="translated_ip"          type="string"  indexed="true" stored="true" multiValued="true" />
   <field name="translated_port"        type="string"  indexed="true" stored="true" multiValued="true" />
   <field name="site"                           type="string"  indexed="true" stored="true" multiValued="true" />
   <field name="url"                            type="string"  indexed="true" stored="true" multiValued="true" />
   <field name="session_time"           type="date"  indexed="true" stored="true" multiValued="true" />
   <field name="event_time"             type="date"  indexed="true" stored="true" multiValued="true" />
   <field name="bytes"                          type="tint"     indexed="true" stored="true" multiValued="true" />
   <field name="_version_"                      type="long"     indexed="true" stored="true" multiValued="false"/>
   <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
   <dynamicField name="ignored_*" type="ignored"/>
</fields>

<uniqueKey>id</uniqueKey>

 

any support are very appreciated!

 

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Solr index not all dataset

Contributor
In your morphline file which is reviews.conf. change the order for { generateUUID { field : id } } move this down after readCSV morphlines : [ { id : dpi1 importCommands : ["org.kitesdk.**", "org.apache.solr.**"] commands : [ { readCSV { separator : "," columns : [msisdn,sgsn,local_ip,local_port,external_ip,external_port,translated_ip,translated_port,site,url,session_time,event_time,bytes] quoteChar : "\"" charset : UTF-8 } } { generateUUID { field : id } } { if { conditions : [ { equals { id : [] } } ] then : [ { logDebug { format : "output record: {}", args : ["@{}"] } } ] } }
5 REPLIES 5

Re: Solr index not all dataset

Cloudera Employee

I see your schema is using "id" as the unique key.  What values do you populate that field with?

Highlighted

Re: Solr index not all dataset

Rising Star

Hi!

 

There is no any hidden meaning. I just don't have any uniq id and use random for that...

Re: Solr index not all dataset

Cloudera Employee
Are you using Morphline for indexing? If yes, can you post your file on here?

Re: Solr index not all dataset

Contributor
In your morphline file which is reviews.conf. change the order for { generateUUID { field : id } } move this down after readCSV morphlines : [ { id : dpi1 importCommands : ["org.kitesdk.**", "org.apache.solr.**"] commands : [ { readCSV { separator : "," columns : [msisdn,sgsn,local_ip,local_port,external_ip,external_port,translated_ip,translated_port,site,url,session_time,event_time,bytes] quoteChar : "\"" charset : UTF-8 } } { generateUUID { field : id } } { if { conditions : [ { equals { id : [] } } ] then : [ { logDebug { format : "output record: {}", args : ["@{}"] } } ] } }

Re: Solr index not all dataset

Rising Star
Thanks it works!