Support Questions

Find answers, ask questions, and share your expertise

Getting multiple records while loading nested XML data into HIVE

avatar
Explorer

Hi All,

I am trying to load nested XML file into HIVE table, but i see multiple lines getting generated for the values in Array.

XML File :

<?xml version="1.0" encoding="UTF-8"?> <FormServerResponse> <Status> <StatusCode>0</StatusCode> <StatusDesc> Success</StatusDesc> <ConfidenceIndex> D</ConfidenceIndex> <Messages> <Message severity="Success"> <StatusCode> 0</StatusCode> <PageNumber> 1</PageNumber> <FileName> Filxxxxe.pdf</FileName> <LayoutFileId> xxxx</LayoutFileId> <LayoutPageId> xxxxx</LayoutPageId> <LayoutIdentiferText/> <Text> xx xxxxxx xx</Text> </Message> </Messages> </Status> </FormServerResponse>

Command I used to load data into HIVE:

create table ViewTable( StatusCode string,StatusDesc string,ConfidenceIndex string,Messages struct<Message: array<struct< severity:string,statuscode:string,pagenumber:string,filename:string,layoutfileid:string, layoutpageid:string, layoutidentifertext:string,text:string>>> ) row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe' with serdeproperties ( "column.xpath.StatusCode" = "/Status/StatusCode/text()", "column.xpath.StatusDesc" = "/Status/StatusDesc/text()","column.xpath.ConfidenceIndex" = "/Status/ConfidenceIndex/text()", "column.xpath.Messages" = "/Status/Messages/Message" ) stored as inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' tblproperties ( "xmlinput.start" = "<Status>" ,"xmlinput.end" = "</Status>" );

Result :

select * from viewtable; OK 0 Success D {"message":[{"severity":null,"statuscode":"0","pagenumber":null,"filename":null,"layoutfileid":null,"layoutpageid":null,"layoutidentifertext":null,"text":null},{"severity":null,"statuscode":null,"pagenumber":"1","filename":null,"layoutfileid":null,"layoutpageid":null,"layoutidentifertext":null,"text":null},{"severity":null,"statuscode":null,"pagenumber":null,"filename":"Filxxxxe.pdf","layoutfileid":null,"layoutpageid":null,"layoutidentifertext":null,"text":null},{"severity":null,"statuscode":null,"pagenumber":null,"filename":null,"layoutfileid":"xxxx","layoutpageid":null,"layoutidentifertext":null,"text":null},{"severity":null,"statuscode":null,"pagenumber":null,"filename":null,"layoutfileid":null,"layoutpageid":"xxxxx","layoutidentifertext":null,"text":null},{"severity":null,"statuscode":null,"pagenumber":null,"filename":null,"layoutfileid":null,"layoutpageid":null,"layoutidentifertext":null,"text":null},{"severity":null,"statuscode":null,"pagenumber":null,"filename":null,"layoutfileid":null,"layoutpageid":null,"layoutidentifertext":null,"text":"xx xxxxxx xx"}]} Time taken: 0.423 seconds, Fetched: 1 row(s)

Result expected:

0 Success D {"message":[{"severity":success,"statuscode":"0","pagenumber":1,"filename":Filxxxe.pdf,"layoutfileid":xxxx,"layoutpageid":xxxxx,"layoutidentifertext":null,"text":xx xxxxx xx}]}

Could someone help me where I am making mistake?????

1 ACCEPTED SOLUTION

avatar

Hey @Anji Raju!
I made a test here, and it seems that you need to change your
Messages struct<Message: array<struct<
for
Messages array<struct<Message:struct<

Here's my test:

#My sample.xml file
<Status> <StatusCode>0</StatusCode> <StatusDesc>Success</StatusDesc> <ConfidenceIndex>D</ConfidenceIndex> <Messages> <Message> <severity></severity> <statuscode>0</statuscode> <pagenumber>1</pagenumber> <filename>Filxxxxe.pdf</filename> <layoutfileid>xxxx</layoutfileid> <layoutpageid>xxxxx</layoutpageid> <layoutidentifertext></layoutidentifertext> <text>xx xxxxxx xx</text> </Message> </Messages> </Status>
#Downloading and adding the jar from ibm.spss wget http://search.maven.org/remotecontent?filepath=com/ibm/spss/hive/serde2/xml/hivexmlserde/1.0.5.3/hiv... mv remotecontent?filepath=com%2Fibm%2Fspss%2Fhive%2Fserde2%2Fxml%2Fhivexmlserde%2F1.0.5.3%2Fhivexmlserde-1.0.5.3.jar hivexmlserde-1.0.5.3.jar #Creating the HDFS location for the table and putting the sample.xml above
hdfs dfs -mkdir /user/hive/warehouse/hive-xml-test hdfs dfs -put sample.xml /user/hive/warehouse/hive-xml-test #Starting the test in Hive hive> add jar /tmp/hivexmlserde-1.0.5.3.jar; create external table xmltest( StatusCode string, StatusDesc string, ConfidenceIndex string, Messages array<struct<Message:struct< severity:string, statuscode:string, pagenumber:string, filename:string, layoutfileid:string, layoutpageid:string, layoutidentifertext:string, text:string>>> ) row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe' with serdeproperties ( "column.xpath.StatusCode" = "/Status/StatusCode/text()" ,"column.xpath.StatusDesc" = "/Status/StatusDesc/text()" ,"column.xpath.ConfidenceIndex" = "/Status/ConfidenceIndex/text()" , "column.xpath.Messages" = "/Status/Messages/Message" ) stored as inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' location '/user/hive/warehouse/hive-xml-test' tblproperties ( "xmlinput.start" = "<Status>" ,"xmlinput.end" = "</Status>" ); hive> select * from xmltest; OK 0 Success D [{"message":{"severity":null,"statuscode":"0","pagenumber":"1","filename":"Filxxxxe.pdf","layoutfileid":"xxxx","layoutpageid":"xxxxx","layoutidentifertext":null,"text":"xx xxxxxx xx"}}] Time taken: 0.104 seconds, Fetched: 1 row(s)

Hope this helps!

View solution in original post

4 REPLIES 4

avatar

Hey @Anji Raju!
I made a test here, and it seems that you need to change your
Messages struct<Message: array<struct<
for
Messages array<struct<Message:struct<

Here's my test:

#My sample.xml file
<Status> <StatusCode>0</StatusCode> <StatusDesc>Success</StatusDesc> <ConfidenceIndex>D</ConfidenceIndex> <Messages> <Message> <severity></severity> <statuscode>0</statuscode> <pagenumber>1</pagenumber> <filename>Filxxxxe.pdf</filename> <layoutfileid>xxxx</layoutfileid> <layoutpageid>xxxxx</layoutpageid> <layoutidentifertext></layoutidentifertext> <text>xx xxxxxx xx</text> </Message> </Messages> </Status>
#Downloading and adding the jar from ibm.spss wget http://search.maven.org/remotecontent?filepath=com/ibm/spss/hive/serde2/xml/hivexmlserde/1.0.5.3/hiv... mv remotecontent?filepath=com%2Fibm%2Fspss%2Fhive%2Fserde2%2Fxml%2Fhivexmlserde%2F1.0.5.3%2Fhivexmlserde-1.0.5.3.jar hivexmlserde-1.0.5.3.jar #Creating the HDFS location for the table and putting the sample.xml above
hdfs dfs -mkdir /user/hive/warehouse/hive-xml-test hdfs dfs -put sample.xml /user/hive/warehouse/hive-xml-test #Starting the test in Hive hive> add jar /tmp/hivexmlserde-1.0.5.3.jar; create external table xmltest( StatusCode string, StatusDesc string, ConfidenceIndex string, Messages array<struct<Message:struct< severity:string, statuscode:string, pagenumber:string, filename:string, layoutfileid:string, layoutpageid:string, layoutidentifertext:string, text:string>>> ) row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe' with serdeproperties ( "column.xpath.StatusCode" = "/Status/StatusCode/text()" ,"column.xpath.StatusDesc" = "/Status/StatusDesc/text()" ,"column.xpath.ConfidenceIndex" = "/Status/ConfidenceIndex/text()" , "column.xpath.Messages" = "/Status/Messages/Message" ) stored as inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' location '/user/hive/warehouse/hive-xml-test' tblproperties ( "xmlinput.start" = "<Status>" ,"xmlinput.end" = "</Status>" ); hive> select * from xmltest; OK 0 Success D [{"message":{"severity":null,"statuscode":"0","pagenumber":"1","filename":"Filxxxxe.pdf","layoutfileid":"xxxx","layoutpageid":"xxxxx","layoutidentifertext":null,"text":"xx xxxxxx xx"}}] Time taken: 0.104 seconds, Fetched: 1 row(s)

Hope this helps!

avatar
Explorer

Thanks @Vinicius Higa Murakami. I was able to make correction and was able to achieve it.

There is one more help I need . I am trying to load the complex structure of XML> i am able to load all the fields but there is one section which I am trying in many way to load the data but upon the loading of data, i still see the values to be NULL>

Attached the files.

a) XPATH what i am writing b) the XML file I am loading

xpath.txt

nestedxml.xml

Output:

0 Success A [{"message":{"severity":null,"statuscode":"0","pagenumber":"1","filename":"File.pdf","layoutfileid":"ACORD_125_2016_03","layoutpageid":"page1_COMMERCIAL_INSURANCE_APPLICATION","layoutidentifertext":null,"text":"Page Successfully Processed"}},{"message":{"severity":null,"statuscode":"0","pagenumber":"2","filename":"File.pdf","layoutfileid":"ACORD_125_2016_03","layoutpageid":"page2_CONTACT_INFORMATION","layoutidentifertext":null,"text":"Page Successfully Processed"}},{"message":{"severity":null,"statuscode":"0","pagenumber":"3","filename":"File.pdf","layoutfileid":"ACORD_125_2016_03","layoutpageid":"page3_GENERAL_INFORMATION","layoutidentifertext":null,"text":"Page Successfully Processed"}},{"message":{"severity":null,"statuscode":"0","pagenumber":"4","filename":"File.pdf","layoutfileid":"ACORD_125_2016_03","layoutpageid":"page4_PRIOR_CARRIER_INFORMATION","layoutidentifertext":null,"text":"Page Successfully Processed"}}] {"transactionid":"80c32800-65cb-11e8-8f5a-005056a3433861521442784784","productversion":"2.06.3.00064","requestdate":"2018-06-01","processingtime":"4 sec","filesize":"464888 bytes","extractiontechnique":"Text"} {"lobcd":"CPKGE","filesreceivedcnt":"1","pagesreceivedcnt":"4","pagesprocessedcnt":"4","formsprocessedcnt":"1","forms":[{"form":{"formname":"ACORD_125_2016_03","totalfields":"533","fieldswithdata":"82","ratebearingfields":"12","ratebearingfieldswithdata":"9"}}]} NULL

Could you please help me with this

Thanks,

Harish

avatar

Hi @Anji Raju!
Hmm, guess there's a little almost invisible mistake in your xpath 🙂
Try to change your ReturnPayLoad

 "column.xpath.ReturnPayLoad" = "/FormServerResponse/ReturnPaylLoad/ACORD/SignonRq"

to ReturnPayload

"column.xpath.ReturnPayLoad" = "/FormServerResponse/ReturnPayload/ACORD/SignonRq"

Hope this helps!

avatar
Explorer

Thanks @Vinicius Higa Murakami , I was able to figure out the Typo Mistake. Thanks for the help