Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hive XML Parising - Null value returned

Solved Go to solution
Highlighted

Hive XML Parising - Null value returned

Hi All,

Tried out a sample xml parsing using the Serde. but it returns a null value.

hive> DROP TABLE BOOKDATA;
OK
Time taken: 0.486 seconds
hive>
  > CREATE EXTERNAL TABLE BOOKDATA(
  > TITLE VARCHAR(40),
  > PRICE  INT
  > )ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
  > WITH SERDEPROPERTIES (
  > "column.xpath.TITLE"="/CATALOG/BOOK/TITLE/",
  > "column.xpath.PRICE"="/CATALOG/BOOK/PRICE/")
  > STORED AS
  > INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
  > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
  > LOCATION '/sourcedata'
  > TBLPROPERTIES (
  > "xmlinput.start"="<CATALOG",
  > "xmlinput.end"= "</CATALOG>"
  > );
OK
Time taken: 0.285 seconds
hive> select * from BOOKDATA;
OK
NULL  NULL
Time taken: 0.184 seconds, Fetched: 1 row(s)
hive>
~]$ hadoop fs -cat /sourcedata/bookdata.xml
<CATALOG>
<BOOK>
<TITLE>Hadoop Defnitive Guide</TITLE>
<AUTHOR>Tom White</AUTHOR>
<COUNTRY>US</COUNTRY>
<COMPANY>CLOUDERA</COMPANY>
<PRICE>24.90</PRICE>
<YEAR>2012</YEAR>
</BOOK>
<BOOK>
<TITLE>Programming Pig</TITLE>
<AUTHOR>Alan Gates</AUTHOR>
<COUNTRY>USA</COUNTRY>
<COMPANY>Horton Works</COMPANY>
<PRICE>30.90</PRICE>
<YEAR>2013</YEAR>
</BOOK>
</CATALOG>
1 ACCEPTED SOLUTION

Accepted Solutions

Re: Hive XML Parising - Null value returned

Hi @elan chelian. It works with following changes (details here):

  • TITLE must be STRING, it seems XmlSerDe doesn't support VARCHAR yet
  • PRICE must be declared as FLOAT or DOUBLE, not INT (e.g., 24.90)
  • Your unit record of data is BOOK, not CATALOG
  • You are missing text() to capture specific values

Declarations:

DROP TABLE IF EXISTS BOOKDATA;
CREATE EXTERNAL TABLE BOOKDATA (TITLE STRING, PRICE FLOAT)
 ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
 WITH SERDEPROPERTIES (
 "column.xpath.TITLE"="/BOOK/TITLE/text()",
 "column.xpath.PRICE"="/BOOK/PRICE/text()")
 STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
 OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
 LOCATION '/user/it1/hive/xml'
 TBLPROPERTIES ("xmlinput.start"="<BOOK","xmlinput.end"= "</BOOK>");

Test:

hive> select * from BOOKDATA;
OK
Hadoop Defnitive Guide	24.9
Programming Pig	30.9
Time taken: 0.081 seconds, Fetched: 2 row(s)
2 REPLIES 2

Re: Hive XML Parising - Null value returned

Hi @elan chelian. It works with following changes (details here):

  • TITLE must be STRING, it seems XmlSerDe doesn't support VARCHAR yet
  • PRICE must be declared as FLOAT or DOUBLE, not INT (e.g., 24.90)
  • Your unit record of data is BOOK, not CATALOG
  • You are missing text() to capture specific values

Declarations:

DROP TABLE IF EXISTS BOOKDATA;
CREATE EXTERNAL TABLE BOOKDATA (TITLE STRING, PRICE FLOAT)
 ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
 WITH SERDEPROPERTIES (
 "column.xpath.TITLE"="/BOOK/TITLE/text()",
 "column.xpath.PRICE"="/BOOK/PRICE/text()")
 STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
 OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
 LOCATION '/user/it1/hive/xml'
 TBLPROPERTIES ("xmlinput.start"="<BOOK","xmlinput.end"= "</BOOK>");

Test:

hive> select * from BOOKDATA;
OK
Hadoop Defnitive Guide	24.9
Programming Pig	30.9
Time taken: 0.081 seconds, Fetched: 2 row(s)

Re: Hive XML Parising - Null value returned

@Predrag Minovic Thanks for the answer. impressive. I was assuming the root node (catalog) needs to be mentioned in the xmlinput.start and xmlinput.end, so that all the nodes in between the root nodes can be queried using the xpath. Thanks for the clarification.

Don't have an account?
Coming from Hortonworks? Activate your account here