Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Parse XML in hive linking parent and child elements

Parse XML in hive linking parent and child elements

New Contributor
I have an xml file which am working on to parse in hive. I was able to use xmlserde to extract elements, however, I am unable to link the parent and child node attribute values. consider the below xml
<?xml version="1.0" encoding="utf-8"?>
<Tests>
	<subtests id="1">
		<subtests_data k="1" v="2" />
		<subtests_data k="2" v="2.90" />
		<subtests_data k="3" v="3.80" />
	</subtests>
		<subtests id="2">
		<subtests_data k="1" v="4" />
		<subtests_data k="2" v="5.90" />
		<subtests_data k="3" v="9.80" />
	</subtests>
</Tests>

Here's the hive table to parse the above xml

CREATE EXTERNAL TABLE samplexml(
subtests_id ARRAY<STRING>,
subtests_data_k ARRAY<STRING>,
subtests_data_v ARRAY<STRING>)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.subtests_id"="/Tests/Subtests/@id",
"column.xpath.subtests_data_k"="/Tests/subtests_data/@k",
"column.xpath.subtests_data_v"="/Tests/subtests_data/@v",
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/xmls/'
TBLPROPERTIES (
"xmlinput.start"="<Tests>",
"xmlinput.end"="</Tests>"
);

When I query the tables using the below query, I get all values of subtests_data for a subtests. For example, I'll get all subtests_data for subtest Id =1 and 2. So am getting a cross join.

SELECT 
     main_data_enum
    ,main_data_value
    ,device_itar 
    ,device_serial_number
    ,test_itar
    ,test_value
    ,array_index(subtests_id,n1.n) as subtests_id
    ,array_index(subtests_data_k,n1.n) as subtests_data_k
    ,array_index(subtests_data_v,n2.n) as subtests_data_v
FROM  samplexml
LATERAL VIEW numeric_range( size(subtests_id)) n1 as n
LATERAL VIEW numeric_range( size(subtests_data_k)) n2 as n;

How can I get subtests_data values as 2,2.9 & 3.8 for subtests id=1 and 4,5.9 & 9.8 for subtests id=2

Don't have an account?
Coming from Hortonworks? Activate your account here