Created 09-12-2016 12:44 PM
Trying to store pig expected output into elastic-search index.
But getting String index out of range: -1 exception
Expected-output:-
(google_1473682742_265278445560,{(Thu Apr 12 17:38:47 +0000 2012,190494185374220289,190494185374220289,google اااااح الاجواء بتاعت سكس حااااارر منو الفحل اللي يبي اسوي له فولو يسوي رتويت,<a href="http://blackberry.com/twitter" rel="nofollow">Twitter for BlackBerry®</a>,[hashtags#[],user_mentions#[],urls#[]],false,,0,false,),(Thu Apr 12 17:38:47 +0000 2012,190494185382608899,190494185382608899,kpit 味も素っ気もない人間とは…。,<a href="http://tapbots.com/tweetbot" rel="nofollow">Tweetbot for iOS</a>,[hashtags#[],user_mentions#[],urls#[]],false,,0,false,)})
describe output;-
output: {pattern: chararray,tweets: {(lowertweets::created_at: chararray,lowertweets::id: chararray,lowertweets::id_str: chararray,lowertweets::text: chararray,lowertweets::source: chararray,lowertweets::entities: map[chararray],lowertweets::favorited: boolean,lowertweets::favorite_count: long,lowertweets::retweet_count: long,lowertweets::retweeted: boolean,lowertweets::place: map[chararray])}}
script:-
STORE A INTO 'google_1473673952_265276863360/tweets' USING org.elasticsearch.hadoop.pig.EsStorage('es.nodes = ip:9200');
Curl Script:-
curl -XPUT 'http://hostname:9200/google_1473673952_265276863360/_mapping/tweets' -d ' { "tweets" :{ "properties" : { "pattern" : " {"type" : "string", "store" : true}, "created_at" : {"type" : "string", "store" : true },"id" : {"type" : "string", "store" : true }, "id_str" : {"type" : "string", "store" : true },"text" : {"type" : "string", "store" : true },"source" : {"type" : "string", "store" : true },"entities" : {"type" : "string", "store" : true },"favorited" : {"type" : "boolean", "store" : true },"favorite_count" : {"type" : "string", "store" : true },"retweet_count" : {"type" : "string", "store" : true },"retweeted" : {"type" : "boolean", "store" : true },"place" : {"type" : "string", "store" : true } }}}'
Error:-
java.lang.Exception: java.io.IOException: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) Caused by: java.io.IOException: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:479) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:442)
I tried by changing the datatypes in curl but it didn't worked me
Any help
Created 09-13-2016 07:36 PM
I believe the problem may be that you are defining your mapping to use string for some of the data elements, however, they are nested element types in Pig. For example, look at entities:
lowertweets::entities: map[chararray]
In your template you have this:
"entities" : {"type" : "string", "store" : true }
So Elasticsearch is expecting that Entities is a string field, not a nested object field. This is also true for place:
lowertweets::place: map[chararray])
"place" : {"type" : "string", "store" : true }
You may want to look at: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/pig.html
es.mapping.pig.tuple.use.field.names true |
And this bug may be relevant:
Created 09-14-2016 05:18 AM
thanks for your reply Michael Young.
I tried what you have suggested me.
But still getting the same issue.
here is what i have tried,
STORE A INTO 'google_1473673952_265276863360/tweets' USING org.elasticsearch.hadoop.pig.EsStorage('es.nodes = ip:9200','es.mapping.pig.tuple.use.field.names = true');
Error:-
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1967)
Do I need to change the curl script.If it is, then how can we give the datatype as array to map in es curl mapping.
Please suggest as I am completely new to this
Created 09-14-2016 01:29 PM
Yes, you need to change the template you are passing to Elasticsearch with the curl command. Here is the documentation for nested objects: https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-mapping.html
Here is an example of a nested definition:
{ "mappings": { "blogpost": { "properties": { "comments": { "type": "nested", "properties": { "name": { "type": "string" }, "comment": { "type": "string" }, "age": { "type": "short" }, "stars": { "type": "short" }, "date": { "type": "date" } } } } } } }
For the nested objects, entities and place in your case, you need to specify type "nested" instead of "string". Then addd a "properties" node with the content that is expected in that. In the example above you might see data that looks like:
{ "blogpost" : { "comments" : { "name" : "Bob", "comment" : "This is my comment", "age": 43, "stars" : 5, "date" : "20160914" } } }
Created 09-16-2016 07:34 AM
I tried what you have suggested Michael Young.
Curl Script:-
curl -XPUT 'http://hostname:9200/google_1473673952_265276863360/_mapping/tweets' -d '{ "tweets" : { "properties" : { "comments": { "type": "nested", "properties": { "pattern" : {"type" : "string", "store" : true}, "created_at" : {"type" : "string", "store" : true }, "id" : {"type" : "string", "store" : true }, "id_str" : {"type" : "string", "store" : true }, "text" : {"type" : "string", "store" : true }, "source" : {"type" : "string", "store" : true }, "entities" : {"type" : "string", "store" : true }, "favorited" : {"type" : "boolean", "store" : true }, "favorite_count" : {"type" : "long", "store" : true }, "retweet_count" : {"type" : "long", "store" : true }, "retweeted" : {"type" : "boolean", "store" : true }, "place" : {"type" : "string", "store" : true } } } } }}'
when i tried this I got the same error ie. string out of bound exception.
I have changed the curl script and tried this.
curl -XPUT 'http://hostname:9200/google_1473673952_265276863360/_mapping/tweets' -d ' { "tweets" : { "properties" : { "comments": { "type": "nested", "properties": { "pattern" : {"type" : "string", "store" : true}, "created_at" : {"type" : "string", "store" : true }, "id" : {"type" : "string", "store" : true }, "id_str" : {"type" : "string", "store" : true }, "text" : {"type" : "string", "store" : true }, "source" : {"type" : "string", "store" : true }, "entities" :{ "properties" : { "type": "nested", "properties": { "urls": {"type": "string"}, "hashtags": {"type": "string"}, "user_mentions": {"type": "string"}, "symbols": {"type": "string"} } } }, "favorited" : {"type" : "boolean", "store" : true }, "favorite_count" : {"type" : "long", "store" : true }, "retweet_count" : {"type" : "long", "store" : true }, "retweeted" : {"type" : "boolean", "store" : true }, "place" :"properties":{ "comments":{ "type": "nested", "properties": { } } } } } } }}'
But i am unable to map it.getting error as {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"mapper [comments.entities] of different type, current_type [string], merged_type [ObjectMapper]"}],"type":"illegal_argument_exception","reason":"mapper [comments.entities] of different type, current_type [string], merged_type [ObjectMapper]"},"status":400}
Please suggest me what i am missing.
Created 09-16-2016 04:45 PM
I believe you have an issue with your template. You have an extra properties value under entites:
"entities" :{ "properties" : { "type": "nested", "properties": {
You should instead have:
"entities" :{ "type": "nested", "properties": {
You can also try it without "type" : "nested" like this:
"entities" :{ "properties": {
Created 09-19-2016 07:01 AM
I am soo thankful to your reply Michael Young.
I have tried everything what you have suggested, but no luck.
And also when i tried to map a single attribute that is pattern from the below output
describe a;
a: {pattern: chararray,tweets: {(lowertweets::created_at: chararray,lowertweets::id: chararray,lowertweets::id_str: chararray,lowertweets::text: chararray,lowertweets::source: chararray,lowertweets::entities: map[chararray],lowertweets::favorited: boolean,lowertweets::favorite_count: long,lowertweets::retweet_count: long,lowertweets::retweeted: boolean,lowertweets::place: map[chararray])}}
Index and mapping were done without any issues, but when i try to store it using
pig script:-
STORE A INTO 'google_1473673952_265276863360/tweets' USING org.elasticsearch.hadoop.pig.EsStorage('es.nodes = hostname:9200','es.mapping.pig.tuple.use.field.names = true');
then again the same issue.
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1967)
Please help me.
Created 09-19-2016 05:38 PM
The error message you are seeing is coming from Pig correct? What do the Elasticsearch logs indicate is happening on that side?
Looking at your initial tweet example, I wonder if the problem may be related to a Left-To-Right, Right-To-Left language issue causing a problem. I can't say that I've seen it with your particular example before, but it can be known to cause issues.