Created on 03-29-2020 03:13 AM - edited 03-29-2020 05:25 AM
For below sample xml file:
<Company>
<Employee>
<Id>458790</Id>
<Name>Sameer</Name>
<Email>sameer@email.com</Email>
<Address>
<HouseNo>105</HouseNo>
<Street>Grand Road</Street>
<City>Bangalore</City>
<State>Karnataka</State>
<Pincode>560068</Pincode>
<Country>India</Country>
<Passport>Available</Passport>
<Visa/>
<Contact>
<Mobile>9909999999</Mobile>
<Phone>8044552266</Phone>
</Contact>
</Address>
</Employee>
<Employee>
<Id>458791</Id>
<Name>Gohar</Name>
<Email>Gohar@email.com</Email>
<Address>
<HouseNo>485</HouseNo>
<Street>Camac Street Road</Street>
<City>Mumbai</City>
<State>Maharastra</State>
<Pincode>400001</Pincode>
<Country>India</Country>
<Passport>Available</Passport>
<Visa/>
<Contact>
<Mobile>9908888888</Mobile>
<Phone/>
</Contact>
</Address>
</Employee>
</Company>
Executed this query for a table containing three columns
id name email
Getting all employees information from xml into different rows like:
458790 Sameer sameer@email.com
458791 Gohar Gohar@email.com
by executing below query
CREATE EXTERNAL TABLE Company (Id STRING, Name String,Email String)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.Id"="Employee/Id/text()",
"column.xpath.Name"="Employee/Name/text()",
"column.xpath.Email"="Employee/Email/text()")
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/add/hive/xml'
TBLPROPERTIES ("xmlinput.start"="<Company","xmlinput.end"= "</Company>")
but when I am storing it as irc or textfile it is giving null values
like stored as textfile then
NULL NULL NULL
how to resolve it, Thanks
Created 03-29-2020 05:58 AM
@saaga119 The solution here is to create the external table in the format you need to query the raw xml data. Seems like you have this done already. Next create a native hive table. It could still be external too. The new table should be the schema for the fields you want with required ORC lines. You can also choose some other text format; for example: csv. Last INSERT INTO new_table SELECT * FROM external_table.
The main idea here is to use external and staging tables, then INSERT INTO SELECT to fill the final table you want. Final table could then be optimized for performance (ORC). This kind of idea also allows multiple external/raw/staging tables able to be combined (select w/ join) into a single final optimized table (orc, compression, partitions, buckets, etc).
Created 03-29-2020 05:58 AM
@saaga119 The solution here is to create the external table in the format you need to query the raw xml data. Seems like you have this done already. Next create a native hive table. It could still be external too. The new table should be the schema for the fields you want with required ORC lines. You can also choose some other text format; for example: csv. Last INSERT INTO new_table SELECT * FROM external_table.
The main idea here is to use external and staging tables, then INSERT INTO SELECT to fill the final table you want. Final table could then be optimized for performance (ORC). This kind of idea also allows multiple external/raw/staging tables able to be combined (select w/ join) into a single final optimized table (orc, compression, partitions, buckets, etc).