Created 06-14-2018 03:41 PM
Hi All,
I am trying to load nested XML file into HIVE table, but i see multiple lines getting generated for the values in Array.
XML File :
<?xml version="1.0" encoding="UTF-8"?>
<FormServerResponse>
<Status>
<StatusCode>0</StatusCode>
<StatusDesc> Success</StatusDesc>
<ConfidenceIndex> D</ConfidenceIndex>
<Messages>
<Message severity="Success">
<StatusCode> 0</StatusCode>
<PageNumber> 1</PageNumber>
<FileName> Filxxxxe.pdf</FileName>
<LayoutFileId> xxxx</LayoutFileId>
<LayoutPageId> xxxxx</LayoutPageId>
<LayoutIdentiferText/>
<Text> xx xxxxxx xx</Text>
</Message>
</Messages>
</Status>
</FormServerResponse>
Command I used to load data into HIVE:
create table ViewTable( StatusCode string,StatusDesc string,ConfidenceIndex string,Messages struct<Message: array<struct< severity:string,statuscode:string,pagenumber:string,filename:string,layoutfileid:string, layoutpageid:string, layoutidentifertext:string,text:string>>> ) row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe' with serdeproperties ( "column.xpath.StatusCode" = "/Status/StatusCode/text()", "column.xpath.StatusDesc" = "/Status/StatusDesc/text()","column.xpath.ConfidenceIndex" = "/Status/ConfidenceIndex/text()", "column.xpath.Messages" = "/Status/Messages/Message" ) stored as inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' tblproperties ( "xmlinput.start" = "<Status>" ,"xmlinput.end" = "</Status>" );
Result :
select * from viewtable; OK 0 Success D {"message":[{"severity":null,"statuscode":"0","pagenumber":null,"filename":null,"layoutfileid":null,"layoutpageid":null,"layoutidentifertext":null,"text":null},{"severity":null,"statuscode":null,"pagenumber":"1","filename":null,"layoutfileid":null,"layoutpageid":null,"layoutidentifertext":null,"text":null},{"severity":null,"statuscode":null,"pagenumber":null,"filename":"Filxxxxe.pdf","layoutfileid":null,"layoutpageid":null,"layoutidentifertext":null,"text":null},{"severity":null,"statuscode":null,"pagenumber":null,"filename":null,"layoutfileid":"xxxx","layoutpageid":null,"layoutidentifertext":null,"text":null},{"severity":null,"statuscode":null,"pagenumber":null,"filename":null,"layoutfileid":null,"layoutpageid":"xxxxx","layoutidentifertext":null,"text":null},{"severity":null,"statuscode":null,"pagenumber":null,"filename":null,"layoutfileid":null,"layoutpageid":null,"layoutidentifertext":null,"text":null},{"severity":null,"statuscode":null,"pagenumber":null,"filename":null,"layoutfileid":null,"layoutpageid":null,"layoutidentifertext":null,"text":"xx xxxxxx xx"}]} Time taken: 0.423 seconds, Fetched: 1 row(s)
Result expected:
0 Success D {"message":[{"severity":success,"statuscode":"0","pagenumber":1,"filename":Filxxxe.pdf,"layoutfileid":xxxx,"layoutpageid":xxxxx,"layoutidentifertext":null,"text":xx xxxxx xx}]}
Could someone help me where I am making mistake?????
Created 06-15-2018 11:04 PM
Hey @Anji Raju!
I made a test here, and it seems that you need to change your
Messages struct<Message: array<struct<
for
Messages array<struct<Message:struct<
Here's my test:
#My sample.xml file
<Status> <StatusCode>0</StatusCode> <StatusDesc>Success</StatusDesc> <ConfidenceIndex>D</ConfidenceIndex> <Messages> <Message> <severity></severity> <statuscode>0</statuscode> <pagenumber>1</pagenumber> <filename>Filxxxxe.pdf</filename> <layoutfileid>xxxx</layoutfileid> <layoutpageid>xxxxx</layoutpageid> <layoutidentifertext></layoutidentifertext> <text>xx xxxxxx xx</text> </Message> </Messages> </Status>
#Downloading and adding the jar from ibm.spss wget http://search.maven.org/remotecontent?filepath=com/ibm/spss/hive/serde2/xml/hivexmlserde/1.0.5.3/hiv... mv remotecontent?filepath=com%2Fibm%2Fspss%2Fhive%2Fserde2%2Fxml%2Fhivexmlserde%2F1.0.5.3%2Fhivexmlserde-1.0.5.3.jar hivexmlserde-1.0.5.3.jar #Creating the HDFS location for the table and putting the sample.xml above
hdfs dfs -mkdir /user/hive/warehouse/hive-xml-test hdfs dfs -put sample.xml /user/hive/warehouse/hive-xml-test #Starting the test in Hive hive> add jar /tmp/hivexmlserde-1.0.5.3.jar; create external table xmltest( StatusCode string, StatusDesc string, ConfidenceIndex string, Messages array<struct<Message:struct< severity:string, statuscode:string, pagenumber:string, filename:string, layoutfileid:string, layoutpageid:string, layoutidentifertext:string, text:string>>> ) row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe' with serdeproperties ( "column.xpath.StatusCode" = "/Status/StatusCode/text()" ,"column.xpath.StatusDesc" = "/Status/StatusDesc/text()" ,"column.xpath.ConfidenceIndex" = "/Status/ConfidenceIndex/text()" , "column.xpath.Messages" = "/Status/Messages/Message" ) stored as inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' location '/user/hive/warehouse/hive-xml-test' tblproperties ( "xmlinput.start" = "<Status>" ,"xmlinput.end" = "</Status>" ); hive> select * from xmltest; OK 0 Success D [{"message":{"severity":null,"statuscode":"0","pagenumber":"1","filename":"Filxxxxe.pdf","layoutfileid":"xxxx","layoutpageid":"xxxxx","layoutidentifertext":null,"text":"xx xxxxxx xx"}}] Time taken: 0.104 seconds, Fetched: 1 row(s)
Hope this helps!
Created 06-15-2018 11:04 PM
Hey @Anji Raju!
I made a test here, and it seems that you need to change your
Messages struct<Message: array<struct<
for
Messages array<struct<Message:struct<
Here's my test:
#My sample.xml file
<Status> <StatusCode>0</StatusCode> <StatusDesc>Success</StatusDesc> <ConfidenceIndex>D</ConfidenceIndex> <Messages> <Message> <severity></severity> <statuscode>0</statuscode> <pagenumber>1</pagenumber> <filename>Filxxxxe.pdf</filename> <layoutfileid>xxxx</layoutfileid> <layoutpageid>xxxxx</layoutpageid> <layoutidentifertext></layoutidentifertext> <text>xx xxxxxx xx</text> </Message> </Messages> </Status>
#Downloading and adding the jar from ibm.spss wget http://search.maven.org/remotecontent?filepath=com/ibm/spss/hive/serde2/xml/hivexmlserde/1.0.5.3/hiv... mv remotecontent?filepath=com%2Fibm%2Fspss%2Fhive%2Fserde2%2Fxml%2Fhivexmlserde%2F1.0.5.3%2Fhivexmlserde-1.0.5.3.jar hivexmlserde-1.0.5.3.jar #Creating the HDFS location for the table and putting the sample.xml above
hdfs dfs -mkdir /user/hive/warehouse/hive-xml-test hdfs dfs -put sample.xml /user/hive/warehouse/hive-xml-test #Starting the test in Hive hive> add jar /tmp/hivexmlserde-1.0.5.3.jar; create external table xmltest( StatusCode string, StatusDesc string, ConfidenceIndex string, Messages array<struct<Message:struct< severity:string, statuscode:string, pagenumber:string, filename:string, layoutfileid:string, layoutpageid:string, layoutidentifertext:string, text:string>>> ) row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe' with serdeproperties ( "column.xpath.StatusCode" = "/Status/StatusCode/text()" ,"column.xpath.StatusDesc" = "/Status/StatusDesc/text()" ,"column.xpath.ConfidenceIndex" = "/Status/ConfidenceIndex/text()" , "column.xpath.Messages" = "/Status/Messages/Message" ) stored as inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' location '/user/hive/warehouse/hive-xml-test' tblproperties ( "xmlinput.start" = "<Status>" ,"xmlinput.end" = "</Status>" ); hive> select * from xmltest; OK 0 Success D [{"message":{"severity":null,"statuscode":"0","pagenumber":"1","filename":"Filxxxxe.pdf","layoutfileid":"xxxx","layoutpageid":"xxxxx","layoutidentifertext":null,"text":"xx xxxxxx xx"}}] Time taken: 0.104 seconds, Fetched: 1 row(s)
Hope this helps!
Created 06-18-2018 08:14 PM
Thanks @Vinicius Higa Murakami. I was able to make correction and was able to achieve it.
There is one more help I need . I am trying to load the complex structure of XML> i am able to load all the fields but there is one section which I am trying in many way to load the data but upon the loading of data, i still see the values to be NULL>
Attached the files.
a) XPATH what i am writing b) the XML file I am loading
Output:
0 Success A [{"message":{"severity":null,"statuscode":"0","pagenumber":"1","filename":"File.pdf","layoutfileid":"ACORD_125_2016_03","layoutpageid":"page1_COMMERCIAL_INSURANCE_APPLICATION","layoutidentifertext":null,"text":"Page Successfully Processed"}},{"message":{"severity":null,"statuscode":"0","pagenumber":"2","filename":"File.pdf","layoutfileid":"ACORD_125_2016_03","layoutpageid":"page2_CONTACT_INFORMATION","layoutidentifertext":null,"text":"Page Successfully Processed"}},{"message":{"severity":null,"statuscode":"0","pagenumber":"3","filename":"File.pdf","layoutfileid":"ACORD_125_2016_03","layoutpageid":"page3_GENERAL_INFORMATION","layoutidentifertext":null,"text":"Page Successfully Processed"}},{"message":{"severity":null,"statuscode":"0","pagenumber":"4","filename":"File.pdf","layoutfileid":"ACORD_125_2016_03","layoutpageid":"page4_PRIOR_CARRIER_INFORMATION","layoutidentifertext":null,"text":"Page Successfully Processed"}}] {"transactionid":"80c32800-65cb-11e8-8f5a-005056a3433861521442784784","productversion":"2.06.3.00064","requestdate":"2018-06-01","processingtime":"4 sec","filesize":"464888 bytes","extractiontechnique":"Text"} {"lobcd":"CPKGE","filesreceivedcnt":"1","pagesreceivedcnt":"4","pagesprocessedcnt":"4","formsprocessedcnt":"1","forms":[{"form":{"formname":"ACORD_125_2016_03","totalfields":"533","fieldswithdata":"82","ratebearingfields":"12","ratebearingfieldswithdata":"9"}}]} NULL
Could you please help me with this
Thanks,
Harish
Created 06-20-2018 03:00 AM
Hi @Anji Raju!
Hmm, guess there's a little almost invisible mistake in your xpath 🙂
Try to change your ReturnPayLoad
"column.xpath.ReturnPayLoad" = "/FormServerResponse/ReturnPaylLoad/ACORD/SignonRq"
to ReturnPayload
"column.xpath.ReturnPayLoad" = "/FormServerResponse/ReturnPayload/ACORD/SignonRq"
Hope this helps!
Created 06-20-2018 01:59 PM
Thanks @Vinicius Higa Murakami , I was able to figure out the Typo Mistake. Thanks for the help