- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Hive XML Parising - Null value returned
- Labels:
-
Apache Hive
Created 06-21-2016 03:35 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
Tried out a sample xml parsing using the Serde. but it returns a null value.
hive> DROP TABLE BOOKDATA; OK Time taken: 0.486 seconds hive> > CREATE EXTERNAL TABLE BOOKDATA( > TITLE VARCHAR(40), > PRICE INT > )ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe' > WITH SERDEPROPERTIES ( > "column.xpath.TITLE"="/CATALOG/BOOK/TITLE/", > "column.xpath.PRICE"="/CATALOG/BOOK/PRICE/") > STORED AS > INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' > LOCATION '/sourcedata' > TBLPROPERTIES ( > "xmlinput.start"="<CATALOG", > "xmlinput.end"= "</CATALOG>" > ); OK Time taken: 0.285 seconds
hive> select * from BOOKDATA; OK NULL NULL Time taken: 0.184 seconds, Fetched: 1 row(s) hive>
~]$ hadoop fs -cat /sourcedata/bookdata.xml <CATALOG> <BOOK> <TITLE>Hadoop Defnitive Guide</TITLE> <AUTHOR>Tom White</AUTHOR> <COUNTRY>US</COUNTRY> <COMPANY>CLOUDERA</COMPANY> <PRICE>24.90</PRICE> <YEAR>2012</YEAR> </BOOK> <BOOK> <TITLE>Programming Pig</TITLE> <AUTHOR>Alan Gates</AUTHOR> <COUNTRY>USA</COUNTRY> <COMPANY>Horton Works</COMPANY> <PRICE>30.90</PRICE> <YEAR>2013</YEAR> </BOOK> </CATALOG>
Created 06-26-2016 04:31 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @elan chelian. It works with following changes (details here😞
- TITLE must be STRING, it seems XmlSerDe doesn't support VARCHAR yet
- PRICE must be declared as FLOAT or DOUBLE, not INT (e.g., 24.90)
- Your unit record of data is BOOK, not CATALOG
- You are missing text() to capture specific values
Declarations:
DROP TABLE IF EXISTS BOOKDATA; CREATE EXTERNAL TABLE BOOKDATA (TITLE STRING, PRICE FLOAT) ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe' WITH SERDEPROPERTIES ( "column.xpath.TITLE"="/BOOK/TITLE/text()", "column.xpath.PRICE"="/BOOK/PRICE/text()") STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION '/user/it1/hive/xml' TBLPROPERTIES ("xmlinput.start"="<BOOK","xmlinput.end"= "</BOOK>");
Test:
hive> select * from BOOKDATA; OK Hadoop Defnitive Guide 24.9 Programming Pig 30.9 Time taken: 0.081 seconds, Fetched: 2 row(s)
Created 06-26-2016 04:31 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @elan chelian. It works with following changes (details here😞
- TITLE must be STRING, it seems XmlSerDe doesn't support VARCHAR yet
- PRICE must be declared as FLOAT or DOUBLE, not INT (e.g., 24.90)
- Your unit record of data is BOOK, not CATALOG
- You are missing text() to capture specific values
Declarations:
DROP TABLE IF EXISTS BOOKDATA; CREATE EXTERNAL TABLE BOOKDATA (TITLE STRING, PRICE FLOAT) ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe' WITH SERDEPROPERTIES ( "column.xpath.TITLE"="/BOOK/TITLE/text()", "column.xpath.PRICE"="/BOOK/PRICE/text()") STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION '/user/it1/hive/xml' TBLPROPERTIES ("xmlinput.start"="<BOOK","xmlinput.end"= "</BOOK>");
Test:
hive> select * from BOOKDATA; OK Hadoop Defnitive Guide 24.9 Programming Pig 30.9 Time taken: 0.081 seconds, Fetched: 2 row(s)
Created 04-19-2020 04:41 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Sir
I got the output as below but I am not getting any data do you know why?
hive> select * from BOOKDATA; OK Hadoop Defnitive Guide 24.9 Programming Pig 30.9 Time taken: 0.081 seconds, Fetched: 2 row(s)
Created 06-28-2016 02:55 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Predrag Minovic Thanks for the answer. impressive. I was assuming the root node (catalog) needs to be mentioned in the xmlinput.start and xmlinput.end, so that all the nodes in between the root nodes can be queried using the xpath. Thanks for the clarification.
