Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Dynamically Parsing XML data in Hive Column and storing into Map

Highlighted

Dynamically Parsing XML data in Hive Column and storing into Map

New Contributor

Hey all,

I have a column in hive that xml data, however the xml format is not static, it changes based on a category column and there are alot. I wanted to know if anybody knows if there is a way to programmaitcally take out the tag and the value in the xml column and store it in map column. Best thing I found is xpath in hive however I need to declare what I am taking out. I can do it with a script via java/scala/python however I would just be connecting to Hive via Jdbc and pulling the data out and doing the work client side. And Unfortunately we do not have spark which I think would be able to do it in.

Any ideas?

6 REPLIES 6

Re: Dynamically Parsing XML data in Hive Column and storing into Map

@Neil Patel

Most flexible way is to use XML Serde (https://github.com/dvasilen/Hive-XML-SerDe/wiki/XML-data-sources) where you can read XML using map and also nested maps/arrays.

The problem is the XML Serde does not work when you already have a table / column with xml, you would need to use their internal functions and write your own UDFs.

The workaround would be to create one table only with your XML column, save as text format, use XML Serde to read textfile with XML content translating them into maps/arrays and finally join result back with your original table.

Also more examples here:

https://community.hortonworks.com/questions/40979/hive-xml-parising-null-value-returned.html

https://community.hortonworks.com/content/kbentry/972/hive-and-xml-pasring.html

Re: Dynamically Parsing XML data in Hive Column and storing into Map

New Contributor

Im looking at the examples and they don't seem to address my issue of dynamically getting the tag and value, in the examples they declaring which tags to parse. I can do that already with xpath.

Re: Dynamically Parsing XML data in Hive Column and storing into Map

Look at this example:

<records>
    <record customer_id="0000-JTALA">
        <income>200000</income>     
        <demographics>
            <gender>F</gender>
            <agecat>1</agecat>
            <edcat>1</edcat>
            <jobcat>2</jobcat>
            <empcat>2</empcat>
            <retire>0</retire>
            <jobsat>1</jobsat>
            <marital>1</marital>
            <spousedcat>1</spousedcat>
            <residecat>4</residecat>
            <homeown>0</homeown>
            <hometype>2</hometype>
            <addresscat>2</addresscat>
        </demographics>
        <financial>
            <income>18</income>
            <creddebt>1.003392</creddebt>
            <othdebt>2.740608</othdebt>
            <default>0</default>
        </financial>
    </record>
    <record>
    .....
    </record>
</records>
CREATE TABLE xml_bank(customer_id STRING, income BIGINT, demographics map<string,string>, financial map<string,string>)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.customer_id"="/record/@customer_id",
"column.xpath.income"="/record/income/text()",
"column.xpath.demographics"="/record/demographics/*",
"column.xpath.financial"="/record/financial/*"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES (
"xmlinput.start"="<record customer",
"xmlinput.end"="</record>"
);


select customer, income demographics, financial from xml_bank limit 1; 
"0000-JTALA"
"200000"
{"gender":"F", "agecat":"1", "edcat":"1", "jobcat":"2", "empcat":"2", "retire":"0", "jobsat":"1", "marital":"1", "spousedcat":"1", "residecat":"4", "homeown":"0", "hometype":"2", "addresscat":"2"}
{"income": "18", "creddebt": "1.003392", "othdebt": "2.740608", "default": "0"}

Re: Dynamically Parsing XML data in Hive Column and storing into Map

New Contributor

Ahh okay thank you makes much more sense now.

Unfortunately we decided to just a create custom python udf to parse the xml column, this will be easier for us since we wouldn't need a seperate process just to extract the xml into another file.

Re: Dynamically Parsing XML data in Hive Column and storing into Map

New Contributor

Re: Dynamically Parsing XML data in Hive Column and storing into Map

New Contributor
Don't have an account?
Coming from Hortonworks? Activate your account here