Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hive XML Parising - Null value returned

avatar
Contributor

Hi All,

Tried out a sample xml parsing using the Serde. but it returns a null value.

hive> DROP TABLE BOOKDATA;
OK
Time taken: 0.486 seconds
hive>
  > CREATE EXTERNAL TABLE BOOKDATA(
  > TITLE VARCHAR(40),
  > PRICE  INT
  > )ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
  > WITH SERDEPROPERTIES (
  > "column.xpath.TITLE"="/CATALOG/BOOK/TITLE/",
  > "column.xpath.PRICE"="/CATALOG/BOOK/PRICE/")
  > STORED AS
  > INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
  > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
  > LOCATION '/sourcedata'
  > TBLPROPERTIES (
  > "xmlinput.start"="<CATALOG",
  > "xmlinput.end"= "</CATALOG>"
  > );
OK
Time taken: 0.285 seconds
hive> select * from BOOKDATA;
OK
NULL  NULL
Time taken: 0.184 seconds, Fetched: 1 row(s)
hive>
~]$ hadoop fs -cat /sourcedata/bookdata.xml
<CATALOG>
<BOOK>
<TITLE>Hadoop Defnitive Guide</TITLE>
<AUTHOR>Tom White</AUTHOR>
<COUNTRY>US</COUNTRY>
<COMPANY>CLOUDERA</COMPANY>
<PRICE>24.90</PRICE>
<YEAR>2012</YEAR>
</BOOK>
<BOOK>
<TITLE>Programming Pig</TITLE>
<AUTHOR>Alan Gates</AUTHOR>
<COUNTRY>USA</COUNTRY>
<COMPANY>Horton Works</COMPANY>
<PRICE>30.90</PRICE>
<YEAR>2013</YEAR>
</BOOK>
</CATALOG>
1 ACCEPTED SOLUTION

avatar
Master Guru

Hi @elan chelian. It works with following changes (details here😞

  • TITLE must be STRING, it seems XmlSerDe doesn't support VARCHAR yet
  • PRICE must be declared as FLOAT or DOUBLE, not INT (e.g., 24.90)
  • Your unit record of data is BOOK, not CATALOG
  • You are missing text() to capture specific values

Declarations:

DROP TABLE IF EXISTS BOOKDATA;
CREATE EXTERNAL TABLE BOOKDATA (TITLE STRING, PRICE FLOAT)
 ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
 WITH SERDEPROPERTIES (
 "column.xpath.TITLE"="/BOOK/TITLE/text()",
 "column.xpath.PRICE"="/BOOK/PRICE/text()")
 STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
 OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
 LOCATION '/user/it1/hive/xml'
 TBLPROPERTIES ("xmlinput.start"="<BOOK","xmlinput.end"= "</BOOK>");

Test:

hive> select * from BOOKDATA;
OK
Hadoop Defnitive Guide	24.9
Programming Pig	30.9
Time taken: 0.081 seconds, Fetched: 2 row(s)

View solution in original post

3 REPLIES 3

avatar
Master Guru

Hi @elan chelian. It works with following changes (details here😞

  • TITLE must be STRING, it seems XmlSerDe doesn't support VARCHAR yet
  • PRICE must be declared as FLOAT or DOUBLE, not INT (e.g., 24.90)
  • Your unit record of data is BOOK, not CATALOG
  • You are missing text() to capture specific values

Declarations:

DROP TABLE IF EXISTS BOOKDATA;
CREATE EXTERNAL TABLE BOOKDATA (TITLE STRING, PRICE FLOAT)
 ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
 WITH SERDEPROPERTIES (
 "column.xpath.TITLE"="/BOOK/TITLE/text()",
 "column.xpath.PRICE"="/BOOK/PRICE/text()")
 STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
 OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
 LOCATION '/user/it1/hive/xml'
 TBLPROPERTIES ("xmlinput.start"="<BOOK","xmlinput.end"= "</BOOK>");

Test:

hive> select * from BOOKDATA;
OK
Hadoop Defnitive Guide	24.9
Programming Pig	30.9
Time taken: 0.081 seconds, Fetched: 2 row(s)

avatar
New Contributor

Hello Sir

I got the output as below but I am not getting any data do you know why?

 

hive> select * from BOOKDATA;
OK
Hadoop Defnitive Guide	24.9
Programming Pig	30.9
Time taken: 0.081 seconds, Fetched: 2 row(s)

avatar
Contributor

@Predrag Minovic Thanks for the answer. impressive. I was assuming the root node (catalog) needs to be mentioned in the xmlinput.start and xmlinput.end, so that all the nodes in between the root nodes can be queried using the xpath. Thanks for the clarification.