Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Hive and XML parsing and saving table as ORC or textFile

New Contributor

For below sample xml file:

<Company>
<Employee>
<Id>458790</Id>
<Name>Sameer</Name>
<Email>sameer@email.com</Email>
<Address>
<HouseNo>105</HouseNo>
<Street>Grand Road</Street>
<City>Bangalore</City>
<State>Karnataka</State>
<Pincode>560068</Pincode>
<Country>India</Country>
<Passport>Available</Passport>
<Visa/>
<Contact>
<Mobile>9909999999</Mobile>
<Phone>8044552266</Phone>
</Contact>
</Address>
</Employee>
<Employee>
<Id>458791</Id>
<Name>Gohar</Name>
<Email>Gohar@email.com</Email>
<Address>
<HouseNo>485</HouseNo>
<Street>Camac Street Road</Street>
<City>Mumbai</City>
<State>Maharastra</State>
<Pincode>400001</Pincode>
<Country>India</Country>
<Passport>Available</Passport>
<Visa/>
<Contact>
<Mobile>9908888888</Mobile>
<Phone/>
</Contact>
</Address>
</Employee>
</Company>


Executed this query for a table containing three columns

id name email

Getting all employees information from xml into different rows like:

458790 Sameer sameer@email.com
458791 Gohar Gohar@email.com

 

by executing below query
CREATE EXTERNAL TABLE Company (Id STRING, Name String,Email String)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.Id"="Employee/Id/text()",
"column.xpath.Name"="Employee/Name/text()",
"column.xpath.Email"="Employee/Email/text()")
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/add/hive/xml'
TBLPROPERTIES ("xmlinput.start"="<Company","xmlinput.end"= "</Company>")

but when I am storing it as irc or textfile it is giving null values

 

like stored as textfile then

NULL NULL NULL

how  to resolve it, Thanks

1 ACCEPTED SOLUTION

@saaga119 The solution here is to create the external table in the format you need to query the raw xml data.  Seems like you have this done already.   Next create a native hive table.  It could still be external too. The new table should be the schema for the fields you want with required ORC lines.  You can also choose some other text format; for example: csv.   Last INSERT INTO new_table SELECT * FROM external_table.

 

The main idea here is to use external and staging tables, then INSERT INTO SELECT to fill the final table you want.  Final table could then be optimized for performance (ORC).   This kind of idea also allows multiple external/raw/staging tables able to be combined (select w/ join) into a single final optimized table (orc, compression, partitions, buckets, etc).

View solution in original post

1 REPLY 1

@saaga119 The solution here is to create the external table in the format you need to query the raw xml data.  Seems like you have this done already.   Next create a native hive table.  It could still be external too. The new table should be the schema for the fields you want with required ORC lines.  You can also choose some other text format; for example: csv.   Last INSERT INTO new_table SELECT * FROM external_table.

 

The main idea here is to use external and staging tables, then INSERT INTO SELECT to fill the final table you want.  Final table could then be optimized for performance (ORC).   This kind of idea also allows multiple external/raw/staging tables able to be combined (select w/ join) into a single final optimized table (orc, compression, partitions, buckets, etc).

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.