Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Getting multiple records while loading nested XML data into HIVE

avatar
Explorer

Hi All,

I am trying to load nested XML file into HIVE table, but i see multiple lines getting generated for the values in Array.

XML File :

<?xml version="1.0" encoding="UTF-8"?> <FormServerResponse> <Status> <StatusCode>0</StatusCode> <StatusDesc> Success</StatusDesc> <ConfidenceIndex> D</ConfidenceIndex> <Messages> <Message severity="Success"> <StatusCode> 0</StatusCode> <PageNumber> 1</PageNumber> <FileName> Filxxxxe.pdf</FileName> <LayoutFileId> xxxx</LayoutFileId> <LayoutPageId> xxxxx</LayoutPageId> <LayoutIdentiferText/> <Text> xx xxxxxx xx</Text> </Message> </Messages> </Status> </FormServerResponse>

Command I used to load data into HIVE:

create table ViewTable( StatusCode string,StatusDesc string,ConfidenceIndex string,Messages struct<Message: array<struct< severity:string,statuscode:string,pagenumber:string,filename:string,layoutfileid:string, layoutpageid:string, layoutidentifertext:string,text:string>>> ) row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe' with serdeproperties ( "column.xpath.StatusCode" = "/Status/StatusCode/text()", "column.xpath.StatusDesc" = "/Status/StatusDesc/text()","column.xpath.ConfidenceIndex" = "/Status/ConfidenceIndex/text()", "column.xpath.Messages" = "/Status/Messages/Message" ) stored as inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' tblproperties ( "xmlinput.start" = "<Status>" ,"xmlinput.end" = "</Status>" );

Result :

select * from viewtable; OK 0 Success D {"message":[{"severity":null,"statuscode":"0","pagenumber":null,"filename":null,"layoutfileid":null,"layoutpageid":null,"layoutidentifertext":null,"text":null},{"severity":null,"statuscode":null,"pagenumber":"1","filename":null,"layoutfileid":null,"layoutpageid":null,"layoutidentifertext":null,"text":null},{"severity":null,"statuscode":null,"pagenumber":null,"filename":"Filxxxxe.pdf","layoutfileid":null,"layoutpageid":null,"layoutidentifertext":null,"text":null},{"severity":null,"statuscode":null,"pagenumber":null,"filename":null,"layoutfileid":"xxxx","layoutpageid":null,"layoutidentifertext":null,"text":null},{"severity":null,"statuscode":null,"pagenumber":null,"filename":null,"layoutfileid":null,"layoutpageid":"xxxxx","layoutidentifertext":null,"text":null},{"severity":null,"statuscode":null,"pagenumber":null,"filename":null,"layoutfileid":null,"layoutpageid":null,"layoutidentifertext":null,"text":null},{"severity":null,"statuscode":null,"pagenumber":null,"filename":null,"layoutfileid":null,"layoutpageid":null,"layoutidentifertext":null,"text":"xx xxxxxx xx"}]} Time taken: 0.423 seconds, Fetched: 1 row(s)

Result expected:

0 Success D {"message":[{"severity":success,"statuscode":"0","pagenumber":1,"filename":Filxxxe.pdf,"layoutfileid":xxxx,"layoutpageid":xxxxx,"layoutidentifertext":null,"text":xx xxxxx xx}]}

Could someone help me where I am making mistake?????

1 ACCEPTED SOLUTION

avatar

Hey @Anji Raju!
I made a test here, and it seems that you need to change your
Messages struct<Message: array<struct<
for
Messages array<struct<Message:struct<

Here's my test:

#My sample.xml file
<Status> <StatusCode>0</StatusCode> <StatusDesc>Success</StatusDesc> <ConfidenceIndex>D</ConfidenceIndex> <Messages> <Message> <severity></severity> <statuscode>0</statuscode> <pagenumber>1</pagenumber> <filename>Filxxxxe.pdf</filename> <layoutfileid>xxxx</layoutfileid> <layoutpageid>xxxxx</layoutpageid> <layoutidentifertext></layoutidentifertext> <text>xx xxxxxx xx</text> </Message> </Messages> </Status>
#Downloading and adding the jar from ibm.spss wget http://search.maven.org/remotecontent?filepath=com/ibm/spss/hive/serde2/xml/hivexmlserde/1.0.5.3/hiv... mv remotecontent?filepath=com%2Fibm%2Fspss%2Fhive%2Fserde2%2Fxml%2Fhivexmlserde%2F1.0.5.3%2Fhivexmlserde-1.0.5.3.jar hivexmlserde-1.0.5.3.jar #Creating the HDFS location for the table and putting the sample.xml above
hdfs dfs -mkdir /user/hive/warehouse/hive-xml-test hdfs dfs -put sample.xml /user/hive/warehouse/hive-xml-test #Starting the test in Hive hive> add jar /tmp/hivexmlserde-1.0.5.3.jar; create external table xmltest( StatusCode string, StatusDesc string, ConfidenceIndex string, Messages array<struct<Message:struct< severity:string, statuscode:string, pagenumber:string, filename:string, layoutfileid:string, layoutpageid:string, layoutidentifertext:string, text:string>>> ) row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe' with serdeproperties ( "column.xpath.StatusCode" = "/Status/StatusCode/text()" ,"column.xpath.StatusDesc" = "/Status/StatusDesc/text()" ,"column.xpath.ConfidenceIndex" = "/Status/ConfidenceIndex/text()" , "column.xpath.Messages" = "/Status/Messages/Message" ) stored as inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' location '/user/hive/warehouse/hive-xml-test' tblproperties ( "xmlinput.start" = "<Status>" ,"xmlinput.end" = "</Status>" ); hive> select * from xmltest; OK 0 Success D [{"message":{"severity":null,"statuscode":"0","pagenumber":"1","filename":"Filxxxxe.pdf","layoutfileid":"xxxx","layoutpageid":"xxxxx","layoutidentifertext":null,"text":"xx xxxxxx xx"}}] Time taken: 0.104 seconds, Fetched: 1 row(s)

Hope this helps!

View solution in original post

4 REPLIES 4

avatar

Hey @Anji Raju!
I made a test here, and it seems that you need to change your
Messages struct<Message: array<struct<
for
Messages array<struct<Message:struct<

Here's my test:

#My sample.xml file
<Status> <StatusCode>0</StatusCode> <StatusDesc>Success</StatusDesc> <ConfidenceIndex>D</ConfidenceIndex> <Messages> <Message> <severity></severity> <statuscode>0</statuscode> <pagenumber>1</pagenumber> <filename>Filxxxxe.pdf</filename> <layoutfileid>xxxx</layoutfileid> <layoutpageid>xxxxx</layoutpageid> <layoutidentifertext></layoutidentifertext> <text>xx xxxxxx xx</text> </Message> </Messages> </Status>
#Downloading and adding the jar from ibm.spss wget http://search.maven.org/remotecontent?filepath=com/ibm/spss/hive/serde2/xml/hivexmlserde/1.0.5.3/hiv... mv remotecontent?filepath=com%2Fibm%2Fspss%2Fhive%2Fserde2%2Fxml%2Fhivexmlserde%2F1.0.5.3%2Fhivexmlserde-1.0.5.3.jar hivexmlserde-1.0.5.3.jar #Creating the HDFS location for the table and putting the sample.xml above
hdfs dfs -mkdir /user/hive/warehouse/hive-xml-test hdfs dfs -put sample.xml /user/hive/warehouse/hive-xml-test #Starting the test in Hive hive> add jar /tmp/hivexmlserde-1.0.5.3.jar; create external table xmltest( StatusCode string, StatusDesc string, ConfidenceIndex string, Messages array<struct<Message:struct< severity:string, statuscode:string, pagenumber:string, filename:string, layoutfileid:string, layoutpageid:string, layoutidentifertext:string, text:string>>> ) row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe' with serdeproperties ( "column.xpath.StatusCode" = "/Status/StatusCode/text()" ,"column.xpath.StatusDesc" = "/Status/StatusDesc/text()" ,"column.xpath.ConfidenceIndex" = "/Status/ConfidenceIndex/text()" , "column.xpath.Messages" = "/Status/Messages/Message" ) stored as inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' location '/user/hive/warehouse/hive-xml-test' tblproperties ( "xmlinput.start" = "<Status>" ,"xmlinput.end" = "</Status>" ); hive> select * from xmltest; OK 0 Success D [{"message":{"severity":null,"statuscode":"0","pagenumber":"1","filename":"Filxxxxe.pdf","layoutfileid":"xxxx","layoutpageid":"xxxxx","layoutidentifertext":null,"text":"xx xxxxxx xx"}}] Time taken: 0.104 seconds, Fetched: 1 row(s)

Hope this helps!

avatar
Explorer

Thanks @Vinicius Higa Murakami. I was able to make correction and was able to achieve it.

There is one more help I need . I am trying to load the complex structure of XML> i am able to load all the fields but there is one section which I am trying in many way to load the data but upon the loading of data, i still see the values to be NULL>

Attached the files.

a) XPATH what i am writing b) the XML file I am loading

xpath.txt

nestedxml.xml

Output:

0 Success A [{"message":{"severity":null,"statuscode":"0","pagenumber":"1","filename":"File.pdf","layoutfileid":"ACORD_125_2016_03","layoutpageid":"page1_COMMERCIAL_INSURANCE_APPLICATION","layoutidentifertext":null,"text":"Page Successfully Processed"}},{"message":{"severity":null,"statuscode":"0","pagenumber":"2","filename":"File.pdf","layoutfileid":"ACORD_125_2016_03","layoutpageid":"page2_CONTACT_INFORMATION","layoutidentifertext":null,"text":"Page Successfully Processed"}},{"message":{"severity":null,"statuscode":"0","pagenumber":"3","filename":"File.pdf","layoutfileid":"ACORD_125_2016_03","layoutpageid":"page3_GENERAL_INFORMATION","layoutidentifertext":null,"text":"Page Successfully Processed"}},{"message":{"severity":null,"statuscode":"0","pagenumber":"4","filename":"File.pdf","layoutfileid":"ACORD_125_2016_03","layoutpageid":"page4_PRIOR_CARRIER_INFORMATION","layoutidentifertext":null,"text":"Page Successfully Processed"}}] {"transactionid":"80c32800-65cb-11e8-8f5a-005056a3433861521442784784","productversion":"2.06.3.00064","requestdate":"2018-06-01","processingtime":"4 sec","filesize":"464888 bytes","extractiontechnique":"Text"} {"lobcd":"CPKGE","filesreceivedcnt":"1","pagesreceivedcnt":"4","pagesprocessedcnt":"4","formsprocessedcnt":"1","forms":[{"form":{"formname":"ACORD_125_2016_03","totalfields":"533","fieldswithdata":"82","ratebearingfields":"12","ratebearingfieldswithdata":"9"}}]} NULL

Could you please help me with this

Thanks,

Harish

avatar

Hi @Anji Raju!
Hmm, guess there's a little almost invisible mistake in your xpath 🙂
Try to change your ReturnPayLoad

 "column.xpath.ReturnPayLoad" = "/FormServerResponse/ReturnPaylLoad/ACORD/SignonRq"

to ReturnPayload

"column.xpath.ReturnPayLoad" = "/FormServerResponse/ReturnPayload/ACORD/SignonRq"

Hope this helps!

avatar
Explorer

Thanks @Vinicius Higa Murakami , I was able to figure out the Typo Mistake. Thanks for the help