Created on 01-16-2015 03:58 AM - edited 09-16-2022 02:19 AM
Hi dear expert!
recentely, I met very strange behaivor of Solr. It index not all dataset.
I try to index CSV files (volume around 0.5 Tb).
But in results a can observe that not all fields are indexed (to be clear only one copy of data are indexed)...
Example:
impala query
select * from dpi2 where msisdn=9851031305
return:
msisdn sgsn local_ip local_port external_ip external_port translated_ip translated_port site uri _c10 _c11 bytes
0 | 9851031305 | NameOfSomeNode2 | 192.168.11.187 | 6455 | 10.62.63.170 | 143 | 213.10.126.195 | 4041 | bing.com | MXAXWVMLKRMM9NLYQGFF | 2014-11-30T03:11:18Z | 2014-11-30T03:17:17Z | 8073 |
1 | 9851031305 | NameOfSomeNode8 | 192.168.11.187 | 6455 | 10.62.63.170 | 143 | 213.10.126.195 | 4041 | bing.com | WOOHQAFFWTQUNO93XNDH | 2014-11-30T03:11:18Z | 2014-11-30T03:12:42Z | 445 |
2 | 9851031305 | NameOfSomeNode2 | 192.168.11.187 | 6455 | 10.62.63.170 | 143 | 213.10.126.195 | 4041 | bing.com | XKSLQUY2ROYYD1YMPVOI | 2014-11-30T03:11:18Z | 2014-11-30T03:17:21Z | 2065 |
thouthand of rows.
But solr search (msisdn:9851031305) return only one row...
{"showDetails":false,"session_time":"[u'2014-11-30T03:11:18Z']","event_time":"[u'2014-11-30T03:15:58Z']","local_ip":"[u'192.168.11.187']","msisdn":["9851031305"],"sgsn":"[u'NameOfSomeNode3']","translated_ip":"[u'213.10.126.195']","bytes":"[4866]","site":"[u'bing.com']","url":"[u'RHHIAU0AIZCSXDDVLUU0']","translated_port":"[u'4041']","external_ip":"[u'10.62.63.170']","local_port":"[u'6455']","_version_":1490445936214671400,"external_port":"[u'143']","id":"aec90d3e-2ecd-4be9-8f4a-572819c1a127","details":[]}
schema.xml file:
<fields> <field name="msisdn" type="string" indexed="true" stored="true" multiValued="true" /> <field name="sgsn" type="string" indexed="true" stored="true" multiValued="true" /> <field name="local_ip" type="string" indexed="true" stored="true" multiValued="true" /> <field name="local_port" type="string" indexed="true" stored="true" multiValued="true" /> <field name="external_ip" type="string" indexed="true" stored="true" multiValued="true" /> <field name="external_port" type="string" indexed="true" stored="true" multiValued="true" /> <field name="translated_ip" type="string" indexed="true" stored="true" multiValued="true" /> <field name="translated_port" type="string" indexed="true" stored="true" multiValued="true" /> <field name="site" type="string" indexed="true" stored="true" multiValued="true" /> <field name="url" type="string" indexed="true" stored="true" multiValued="true" /> <field name="session_time" type="date" indexed="true" stored="true" multiValued="true" /> <field name="event_time" type="date" indexed="true" stored="true" multiValued="true" /> <field name="bytes" type="tint" indexed="true" stored="true" multiValued="true" /> <field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/> <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <dynamicField name="ignored_*" type="ignored"/> </fields> <uniqueKey>id</uniqueKey>
any support are very appreciated!
Created 02-05-2015 06:24 PM
Created 01-22-2015 08:36 AM
I see your schema is using "id" as the unique key. What values do you populate that field with?
Created 01-22-2015 09:01 AM
Hi!
There is no any hidden meaning. I just don't have any uniq id and use random for that...
Created 01-28-2015 07:00 AM
Created 02-05-2015 06:24 PM
Created 02-06-2015 12:34 AM