Support Questions
Find answers, ask questions, and share your expertise

Help on nested arrays coming out of MongoDB into Hive

New Contributor

Hi, I am really hoping there is someone in here that can help me, I've been stuck on this for 4 days now.

I have a MongoDB collection, sample of a document is:

{

"_id" : "58c879787f4fd6a4982f3039",

"Actions" : [

{

"What" :

{

"_t" :

[ "TargetBase", "UserTarget", "CreatedTarget" ],

"Who" :

{

"TontoUserId" : 12345,

"EmailAddress" : "someuser@test.com.au",

"Name" : "some-user"

}

},

"When" : ISODate("2017-03-14T23:49:06.585Z"),

"Reason" : "some text",

"Wrapup" : null

}

]

}

I have created an external table in hive:

create external table if not exists x (my_serial string, actions array<struct<what:struct<`_t`:array<string>, who:struct<tontouserid:string, emailaddress:string, name:string>>, `when`:string, reason:string, wrapup:string>>) STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"my_serial":"_id", "actions":"Actions"}')
TBLPROPERTIES('mongo.uri'='mongodb://testuser:testpwd@test-mongo.test.com.au:27017/somedb.somecollection?authSource=admin'); 

I don't get any errors when running the above, but when I query the table the only data that I get is the my_serial column:

select * from x limit 1; 
58c879787f4fd6a4982f3039        [{"what":null,"when":null,"reason":null,"wrapup":null}] 

Similar story when I try to explode the actions array:

select my_serial, actionstuff.what.`_t` from x
LATERAL VIEW explode(actions) actionstable as actionstuff
limit 1; 

If I try a different tack on creating the external table:

create external table if not exists x (my_serial string, actions array<string>) STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"my_serial":"_id", "actions":"Actions"}')
TBLPROPERTIES('mongo.uri'='mongodb://testuser:testpwd@test-mongo.test.com.au:27017/somedb.somecollection?authSource=admin'); 

I get data, but I don't know how to do anything with it:

58c879787f4fd6a4982f3039        ["{ \"What\" : { \"_t\" : [ \"TargetBase\" , \"UserTarget\" , \"CreatedTarget\"] , \"Who\" : { \"TontoUserId\" : 12345 , \"EmailAddress\" : \"someuser@test.com.au\" , \"Name\" : \"some-user\"}} , \"When\" : { \"$date\" : \"2017-03-14T23:15:04.047Z\"} , \"Reason\" : \"some text\" , \"Wrapup\" :  null }"] 

Anyone able to tell me what I'm doing wrong?

Much appreciated.

1 REPLY 1

New Contributor

Hi again,

I didn't get a reply on any of the sites that I posted my question on, but I have worked out a clunky solution that I wish to share in case anyone else is looking to do something similar.

-- create the external table to the mongo store 
create external table if not exists x (my_serial string, actions array<string>)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'='{"my_serial":"_id", "actions":"Actions"}')
TBLPROPERTIES('mongo.uri'='mongodb://testuser:testpwd@test-mongo.test.com.au:27017/somedb.somecollection?authSource=admin'); 
-- create an intermediary table 
create table job_actions_exploded (my_serial string, t string, reason string, when_date timestamp, who_tonto_user_id string);
-- insert into the intermediary table
insert into job_actions_exploded
(my_serial, t, reason, when_date, who_tonto_user_id)
select my_serial
, reverse(split(reverse(regexp_replace(regexp_replace(regexp_replace(get_json_object(single_json_table.single_json, '$.What._t'), '\\["', ''), '","', ','), '"\\]', '')), ',')[0]) as t 
, get_json_object(single_json_table.single_json, '$.Reason') as reason
, from_utc_timestamp(regexp_replace(regexp_replace(regexp_replace(get_json_object(single_json_table.single_json, '$.When'), '\\{"\\$date"\\:"', ''), 'Z"\\}', ''), 'T', ' '), 'Australia/Brisbane') as when_date
, get_json_object(single_json_table.single_json, '$.What.Who.TontoUserId') as who_tonto_id
from x
lateral view explode(actions) single_json_table as single_json; 

The key to this is the lateral view explode to create single json strings which can then be inspected using the get_json_object function. The get_json_object function is case sensitive when supplying the '$.Column' name.

It's worth noting that I only needed the last value out of the 'What._t' array, hence the reverse,split,reverse. This could also have been achieved using the regexp_extract function.

I also needed the UTC date/time that is stored in mongo converted to Brisbane time (Australian Eastern Standard Time, AEST).

I hope this is useful for someone else out there.