Support Questions

Find answers, ask questions, and share your expertise

Hive and XML parsing and saving table as ORC or textFile

avatar
New Contributor

For below sample xml file:

<Company>
<Employee>
<Id>458790</Id>
<Name>Sameer</Name>
<Email>sameer@email.com</Email>
<Address>
<HouseNo>105</HouseNo>
<Street>Grand Road</Street>
<City>Bangalore</City>
<State>Karnataka</State>
<Pincode>560068</Pincode>
<Country>India</Country>
<Passport>Available</Passport>
<Visa/>
<Contact>
<Mobile>9909999999</Mobile>
<Phone>8044552266</Phone>
</Contact>
</Address>
</Employee>
<Employee>
<Id>458791</Id>
<Name>Gohar</Name>
<Email>Gohar@email.com</Email>
<Address>
<HouseNo>485</HouseNo>
<Street>Camac Street Road</Street>
<City>Mumbai</City>
<State>Maharastra</State>
<Pincode>400001</Pincode>
<Country>India</Country>
<Passport>Available</Passport>
<Visa/>
<Contact>
<Mobile>9908888888</Mobile>
<Phone/>
</Contact>
</Address>
</Employee>
</Company>


Executed this query for a table containing three columns

id name email

Getting all employees information from xml into different rows like:

458790 Sameer sameer@email.com
458791 Gohar Gohar@email.com

 

by executing below query
CREATE EXTERNAL TABLE Company (Id STRING, Name String,Email String)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.Id"="Employee/Id/text()",
"column.xpath.Name"="Employee/Name/text()",
"column.xpath.Email"="Employee/Email/text()")
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/add/hive/xml'
TBLPROPERTIES ("xmlinput.start"="<Company","xmlinput.end"= "</Company>")

but when I am storing it as irc or textfile it is giving null values

 

like stored as textfile then

NULL NULL NULL

how  to resolve it, Thanks

1 ACCEPTED SOLUTION

avatar
Super Guru

@saaga119 The solution here is to create the external table in the format you need to query the raw xml data.  Seems like you have this done already.   Next create a native hive table.  It could still be external too. The new table should be the schema for the fields you want with required ORC lines.  You can also choose some other text format; for example: csv.   Last INSERT INTO new_table SELECT * FROM external_table.

 

The main idea here is to use external and staging tables, then INSERT INTO SELECT to fill the final table you want.  Final table could then be optimized for performance (ORC).   This kind of idea also allows multiple external/raw/staging tables able to be combined (select w/ join) into a single final optimized table (orc, compression, partitions, buckets, etc).

View solution in original post

1 REPLY 1

avatar
Super Guru

@saaga119 The solution here is to create the external table in the format you need to query the raw xml data.  Seems like you have this done already.   Next create a native hive table.  It could still be external too. The new table should be the schema for the fields you want with required ORC lines.  You can also choose some other text format; for example: csv.   Last INSERT INTO new_table SELECT * FROM external_table.

 

The main idea here is to use external and staging tables, then INSERT INTO SELECT to fill the final table you want.  Final table could then be optimized for performance (ORC).   This kind of idea also allows multiple external/raw/staging tables able to be combined (select w/ join) into a single final optimized table (orc, compression, partitions, buckets, etc).