Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Getting multiple records while loading nested XML data into HIVE

Solved Go to solution
Highlighted

Getting multiple records while loading nested XML data into HIVE

New Contributor

Hi All,

I am trying to load nested XML file into HIVE table, but i see multiple lines getting generated for the values in Array.

XML File :

<?xml version="1.0" encoding="UTF-8"?> <FormServerResponse> <Status> <StatusCode>0</StatusCode> <StatusDesc> Success</StatusDesc> <ConfidenceIndex> D</ConfidenceIndex> <Messages> <Message severity="Success"> <StatusCode> 0</StatusCode> <PageNumber> 1</PageNumber> <FileName> Filxxxxe.pdf</FileName> <LayoutFileId> xxxx</LayoutFileId> <LayoutPageId> xxxxx</LayoutPageId> <LayoutIdentiferText/> <Text> xx xxxxxx xx</Text> </Message> </Messages> </Status> </FormServerResponse>

Command I used to load data into HIVE:

create table ViewTable( StatusCode string,StatusDesc string,ConfidenceIndex string,Messages struct<Message: array<struct< severity:string,statuscode:string,pagenumber:string,filename:string,layoutfileid:string, layoutpageid:string, layoutidentifertext:string,text:string>>> ) row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe' with serdeproperties ( "column.xpath.StatusCode" = "/Status/StatusCode/text()", "column.xpath.StatusDesc" = "/Status/StatusDesc/text()","column.xpath.ConfidenceIndex" = "/Status/ConfidenceIndex/text()", "column.xpath.Messages" = "/Status/Messages/Message" ) stored as inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' tblproperties ( "xmlinput.start" = "<Status>" ,"xmlinput.end" = "</Status>" );

Result :

select * from viewtable; OK 0 Success D {"message":[{"severity":null,"statuscode":"0","pagenumber":null,"filename":null,"layoutfileid":null,"layoutpageid":null,"layoutidentifertext":null,"text":null},{"severity":null,"statuscode":null,"pagenumber":"1","filename":null,"layoutfileid":null,"layoutpageid":null,"layoutidentifertext":null,"text":null},{"severity":null,"statuscode":null,"pagenumber":null,"filename":"Filxxxxe.pdf","layoutfileid":null,"layoutpageid":null,"layoutidentifertext":null,"text":null},{"severity":null,"statuscode":null,"pagenumber":null,"filename":null,"layoutfileid":"xxxx","layoutpageid":null,"layoutidentifertext":null,"text":null},{"severity":null,"statuscode":null,"pagenumber":null,"filename":null,"layoutfileid":null,"layoutpageid":"xxxxx","layoutidentifertext":null,"text":null},{"severity":null,"statuscode":null,"pagenumber":null,"filename":null,"layoutfileid":null,"layoutpageid":null,"layoutidentifertext":null,"text":null},{"severity":null,"statuscode":null,"pagenumber":null,"filename":null,"layoutfileid":null,"layoutpageid":null,"layoutidentifertext":null,"text":"xx xxxxxx xx"}]} Time taken: 0.423 seconds, Fetched: 1 row(s)

Result expected:

0 Success D {"message":[{"severity":success,"statuscode":"0","pagenumber":1,"filename":Filxxxe.pdf,"layoutfileid":xxxx,"layoutpageid":xxxxx,"layoutidentifertext":null,"text":xx xxxxx xx}]}

Could someone help me where I am making mistake?????

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Getting multiple records while loading nested XML data into HIVE

Hey @Anji Raju!
I made a test here, and it seems that you need to change your
Messages struct<Message: array<struct<
for
Messages array<struct<Message:struct<

Here's my test:

#My sample.xml file
<Status> <StatusCode>0</StatusCode> <StatusDesc>Success</StatusDesc> <ConfidenceIndex>D</ConfidenceIndex> <Messages> <Message> <severity></severity> <statuscode>0</statuscode> <pagenumber>1</pagenumber> <filename>Filxxxxe.pdf</filename> <layoutfileid>xxxx</layoutfileid> <layoutpageid>xxxxx</layoutpageid> <layoutidentifertext></layoutidentifertext> <text>xx xxxxxx xx</text> </Message> </Messages> </Status>
#Downloading and adding the jar from ibm.spss wget http://search.maven.org/remotecontent?filepath=com/ibm/spss/hive/serde2/xml/hivexmlserde/1.0.5.3/hiv... mv remotecontent?filepath=com%2Fibm%2Fspss%2Fhive%2Fserde2%2Fxml%2Fhivexmlserde%2F1.0.5.3%2Fhivexmlserde-1.0.5.3.jar hivexmlserde-1.0.5.3.jar #Creating the HDFS location for the table and putting the sample.xml above
hdfs dfs -mkdir /user/hive/warehouse/hive-xml-test hdfs dfs -put sample.xml /user/hive/warehouse/hive-xml-test #Starting the test in Hive hive> add jar /tmp/hivexmlserde-1.0.5.3.jar; create external table xmltest( StatusCode string, StatusDesc string, ConfidenceIndex string, Messages array<struct<Message:struct< severity:string, statuscode:string, pagenumber:string, filename:string, layoutfileid:string, layoutpageid:string, layoutidentifertext:string, text:string>>> ) row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe' with serdeproperties ( "column.xpath.StatusCode" = "/Status/StatusCode/text()" ,"column.xpath.StatusDesc" = "/Status/StatusDesc/text()" ,"column.xpath.ConfidenceIndex" = "/Status/ConfidenceIndex/text()" , "column.xpath.Messages" = "/Status/Messages/Message" ) stored as inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' location '/user/hive/warehouse/hive-xml-test' tblproperties ( "xmlinput.start" = "<Status>" ,"xmlinput.end" = "</Status>" ); hive> select * from xmltest; OK 0 Success D [{"message":{"severity":null,"statuscode":"0","pagenumber":"1","filename":"Filxxxxe.pdf","layoutfileid":"xxxx","layoutpageid":"xxxxx","layoutidentifertext":null,"text":"xx xxxxxx xx"}}] Time taken: 0.104 seconds, Fetched: 1 row(s)

Hope this helps!

4 REPLIES 4

Re: Getting multiple records while loading nested XML data into HIVE

Hey @Anji Raju!
I made a test here, and it seems that you need to change your
Messages struct<Message: array<struct<
for
Messages array<struct<Message:struct<

Here's my test:

#My sample.xml file
<Status> <StatusCode>0</StatusCode> <StatusDesc>Success</StatusDesc> <ConfidenceIndex>D</ConfidenceIndex> <Messages> <Message> <severity></severity> <statuscode>0</statuscode> <pagenumber>1</pagenumber> <filename>Filxxxxe.pdf</filename> <layoutfileid>xxxx</layoutfileid> <layoutpageid>xxxxx</layoutpageid> <layoutidentifertext></layoutidentifertext> <text>xx xxxxxx xx</text> </Message> </Messages> </Status>
#Downloading and adding the jar from ibm.spss wget http://search.maven.org/remotecontent?filepath=com/ibm/spss/hive/serde2/xml/hivexmlserde/1.0.5.3/hiv... mv remotecontent?filepath=com%2Fibm%2Fspss%2Fhive%2Fserde2%2Fxml%2Fhivexmlserde%2F1.0.5.3%2Fhivexmlserde-1.0.5.3.jar hivexmlserde-1.0.5.3.jar #Creating the HDFS location for the table and putting the sample.xml above
hdfs dfs -mkdir /user/hive/warehouse/hive-xml-test hdfs dfs -put sample.xml /user/hive/warehouse/hive-xml-test #Starting the test in Hive hive> add jar /tmp/hivexmlserde-1.0.5.3.jar; create external table xmltest( StatusCode string, StatusDesc string, ConfidenceIndex string, Messages array<struct<Message:struct< severity:string, statuscode:string, pagenumber:string, filename:string, layoutfileid:string, layoutpageid:string, layoutidentifertext:string, text:string>>> ) row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe' with serdeproperties ( "column.xpath.StatusCode" = "/Status/StatusCode/text()" ,"column.xpath.StatusDesc" = "/Status/StatusDesc/text()" ,"column.xpath.ConfidenceIndex" = "/Status/ConfidenceIndex/text()" , "column.xpath.Messages" = "/Status/Messages/Message" ) stored as inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' location '/user/hive/warehouse/hive-xml-test' tblproperties ( "xmlinput.start" = "<Status>" ,"xmlinput.end" = "</Status>" ); hive> select * from xmltest; OK 0 Success D [{"message":{"severity":null,"statuscode":"0","pagenumber":"1","filename":"Filxxxxe.pdf","layoutfileid":"xxxx","layoutpageid":"xxxxx","layoutidentifertext":null,"text":"xx xxxxxx xx"}}] Time taken: 0.104 seconds, Fetched: 1 row(s)

Hope this helps!

Re: Getting multiple records while loading nested XML data into HIVE

New Contributor

Thanks @Vinicius Higa Murakami. I was able to make correction and was able to achieve it.

There is one more help I need . I am trying to load the complex structure of XML> i am able to load all the fields but there is one section which I am trying in many way to load the data but upon the loading of data, i still see the values to be NULL>

Attached the files.

a) XPATH what i am writing b) the XML file I am loading

xpath.txt

nestedxml.xml

Output:

0 Success A [{"message":{"severity":null,"statuscode":"0","pagenumber":"1","filename":"File.pdf","layoutfileid":"ACORD_125_2016_03","layoutpageid":"page1_COMMERCIAL_INSURANCE_APPLICATION","layoutidentifertext":null,"text":"Page Successfully Processed"}},{"message":{"severity":null,"statuscode":"0","pagenumber":"2","filename":"File.pdf","layoutfileid":"ACORD_125_2016_03","layoutpageid":"page2_CONTACT_INFORMATION","layoutidentifertext":null,"text":"Page Successfully Processed"}},{"message":{"severity":null,"statuscode":"0","pagenumber":"3","filename":"File.pdf","layoutfileid":"ACORD_125_2016_03","layoutpageid":"page3_GENERAL_INFORMATION","layoutidentifertext":null,"text":"Page Successfully Processed"}},{"message":{"severity":null,"statuscode":"0","pagenumber":"4","filename":"File.pdf","layoutfileid":"ACORD_125_2016_03","layoutpageid":"page4_PRIOR_CARRIER_INFORMATION","layoutidentifertext":null,"text":"Page Successfully Processed"}}] {"transactionid":"80c32800-65cb-11e8-8f5a-005056a3433861521442784784","productversion":"2.06.3.00064","requestdate":"2018-06-01","processingtime":"4 sec","filesize":"464888 bytes","extractiontechnique":"Text"} {"lobcd":"CPKGE","filesreceivedcnt":"1","pagesreceivedcnt":"4","pagesprocessedcnt":"4","formsprocessedcnt":"1","forms":[{"form":{"formname":"ACORD_125_2016_03","totalfields":"533","fieldswithdata":"82","ratebearingfields":"12","ratebearingfieldswithdata":"9"}}]} NULL

Could you please help me with this

Thanks,

Harish

Re: Getting multiple records while loading nested XML data into HIVE

Hi @Anji Raju!
Hmm, guess there's a little almost invisible mistake in your xpath :)
Try to change your ReturnPayLoad

 "column.xpath.ReturnPayLoad" = "/FormServerResponse/ReturnPaylLoad/ACORD/SignonRq"

to ReturnPayload

"column.xpath.ReturnPayLoad" = "/FormServerResponse/ReturnPayload/ACORD/SignonRq"

Hope this helps!

Re: Getting multiple records while loading nested XML data into HIVE

New Contributor

Thanks @Vinicius Higa Murakami , I was able to figure out the Typo Mistake. Thanks for the help

Don't have an account?
Coming from Hortonworks? Activate your account here