Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

hive query with xpath on xml with namespace

avatar
Expert Contributor
<tns:root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tns="http://test.com" xmlns="http://xmlns.oracle.com/pcbpel/adapter/noname">
  <tns:second>
    <tns:third>10379</tns:third>
    <tns:four>stats</tns:four>
    <tns:five>1</tns:five>
    <tns:six>
      <tns:DokumentFilIndhold>K</tns:DokumentFilIndhold>
    </tns:six>
    <tns:seven>2018-06-28T12:57:36</tns:seven>
    <tns:eight>2018-06-28T13:02:28</tns:eight>
  </tns:second>
</tns:root>

This is my test xml,

Testdb is an external table on hbase on hadoop closter

i have tried to select the value of element "four" from my hive table where column contents is the xml

use testdb;
select 
FROM_UNIXTIME(CEIL((CAST(SPLIT(row_key, '\\|')[0] AS BIGINT))/1000)) AS received_at
, SPLIT(row_key, '\\|')[1] AS session_id
 ,row_key
 ,xpath_string(contents,'//tns:root/tns:second/tns:third/tns:four' ) AS test
  where row_key = '1530262082747|08004d28-3cf6-4446-bae1-93d43c08c189';

But it returns empty string value, from my research i can se it probably has something to do with the namespace, but i haven't found anything useful out there yet.

does anyone know of any way to either work with the namespace or ignore it in the hive query

8 REPLIES 8

avatar

Hey @Simon Jespersen!
Try to change your //tns:root/tns:second/tns:third/tns:four for root/second/four
I made a simple test here:

hive> select xpath_string('<tns:root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tns="http://test.com" xmlns="http://xmlns.oracle.com/pcbpel/adapter/noname"> <tns:second><tns:third>10379</tns:third><tns:four>stats</tns:four><tns:five>1</tns:five><tns:six><tns:DokumentFilIndhold>K</tns:DokumentFilIndhold></tns:six><tns:seven>2018-06-28T12:57:36</tns:seven><tns:eight>2018-06-28T13:02:28</tns:eight></tns:second></tns:root>','root/second/four');
OK
stats

Hope this helps!

avatar
Expert Contributor

Hi

Thanks, but i my hive version it doesn't return anything

select xpath_string('<tns:root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tns="http://test.com" xmlns="http://xmlns.oracle.com/pcbpel/adapter/noname"> <tns:second><tns:third>10379</tns:third><tns:four>stats</tns:four><tns:five>1</tns:five><tns:six><tns:DokumentFilIndhold>K</tns:DokumentFilIndhold></tns:six><tns:seven>2018-06-28T12:57:36</tns:seven><tns:eight>2018-06-28T13:02:28</tns:eight></tns:second></tns:root>','/root/second/four');

Returns ""

avatar

Hey @Simon Jespersen!
I'm so sorry for the long delay 😞
So regarding your issue, try to take off the first "/" on the /root/second/four (2nd param for the xpath_string).
Instead of '/root/second/four')would be 'root/second/four')

Hope this helps!

avatar
Expert Contributor

Thank you very much @Vinicius Higa Murakami

When i run this in beeline

select xpath_string('<tns:root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tns="http://test.com" xmlns="http://xmlns.oracle.com/pcbpel/adapter/noname"> <tns:second><tns:third>10379</tns:third><tns:four>stats</tns:four><tns:five>1</tns:five><tns:six><tns:DokumentFilIndhold>K</tns:DokumentFilIndhold></tns:six><tns:seven>2018-06-28T12:57:36</tns:seven><tns:eight>2018-06-28T13:02:28</tns:eight></tns:second></tns:root>','root/second/four');

It returns following

Connected to: Apache Hive (version 1.2.1000.2.5.0.0-1245)
Driver: Hive JDBC (version 1.2.1000.2.5.0.0-1245)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://sktudv01hdp01.ccta.dk:2181,sk> use adm_sfo_sit;
No rows affected (3.891 seconds)
0: jdbc:hive2://sktudv01hdp01.ccta.dk:2181,sk> select xpath_string('<tns:root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tns="http://test.com" xmlns="http://xmlns.oracle.com/pcbpel/adapter/noname"> <tns:second><tns:third>10379</tns:third><tns:four>stats</tns:four><tns:five>1</tns:five><tns:six><tns:DokumentFilIndhold>K</tns:DokumentFilIndhold></tns:six><tns:seven>2018-06-28T12:57:36</tns:seven><tns:eight>2018-06-28T13:02:28</tns:eight></tns:second></tns:root>','root/second/four');
+------+--+
| _c0  |
+------+--+
|      |
+------+--+
1 row selected (0.076 seconds)

avatar

Hi @Simon Jespersen!
Hmmm that's quite strange 😞
I made the same test as you using beeline this time and still having the stat value:

[root@node3 ~]# beeline -u 'jdbc:hive2://node3:10000/default'  -n hive
Connecting to jdbc:hive2://node3:10000/default
Connected to: Apache Hive (version 1.2.1000.2.6.5.0-292)
Driver: Hive JDBC (version 1.2.1000.2.6.5.0-292)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1000.2.6.5.0-292 by Apache Hive
0: jdbc:hive2://node3:10000/default> select xpath_string('<tns:root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tns="http://test.com" xmlns="http://xmlns.oracle.com/pcbpel/adapter/noname"> <tns:second><tns:third>10379</tns:third><tns:four>stats</tns:four><tns:five>1</tns:five><tns:six><tns:DokumentFilIndhold>K</tns:DokumentFilIndhold></tns:six><tns:seven>2018-06-28T12:57:36</tns:seven><tns:eight>2018-06-28T13:02:28</tns:eight></tns:second></tns:root>','root/second/four');
+--------+--+
|  _c0   |
+--------+--+
| stats  |
+--------+--+
1 row selected (3.282 seconds)
0: jdbc:hive2://node3:10000/default> select xpath_string('<tns:root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tns="http://test.com" xmlns="http://xmlns.oracle.com/pcbpel/adapter/noname"> <tns:second><tns:third>10379</tns:third><tns:four>stats</tns:four><tns:five>1</tns:five><tns:six><tns:DokumentFilIndhold>K</tns:DokumentFilIndhold></tns:six><tns:seven>2018-06-28T12:57:36</tns:seven><tns:eight>2018-06-28T13:02:28</tns:eight></tns:second></tns:root>','root/second/four');
+--------+--+
|  _c0   |
+--------+--+
| stats  |
+--------+--+
1 row selected (0.54 seconds)

Do you mind to test it using hiveCLI? Would be something like this:

hive -e "select xpath_string('<tns:root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tns="http://test.com" xmlns="http://xmlns.oracle.com/pcbpel/adapter/noname"> <tns:second><tns:third>10379</tns:third><tns:four>stats</tns:four><tns:five>1</tns:five><tns:six><tns:DokumentFilIndhold>K</tns:DokumentFilIndhold></tns:six><tns:seven>2018-06-28T12:57:36</tns:seven><tns:eight>2018-06-28T13:02:28</tns:eight></tns:second></tns:root>','root/second/four');" 

The only thing that set us apart is the minor number on the hive version:
Mine version 1.2.1000.2.6.5.0-292
Your version 1.2.1000.2.5.0.0-1245

This shouldn't be a problem, anyway I'm gonna check if I find smtg useful between these two versions.

Hope this helps!

avatar
Expert Contributor

Hi @Vinicius Higa Murakami

When i run it on hive cli, it failed

hive -e " select xpath_string('<tns:root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tns="http://test.com" xmlns="http://xmlns.oracle.com/pcbpel/adapter/noname"> <tns:second><tns:third>10379</tns:third><tns:four>stats</tns:four><tns:five>1</tns:five><tns:six><tns:DokumentFilIndhold>K</tns:DokumentFilIndhold></tns:six><tns:seven>2018-06-28T12:57:36</tns:seven><tns:eight>2018-06-28T13:02:28</tns:eight></tns:second></tns:root>','root/second/four');"
2018-07-06 08:27:22,044 WARN  [main] conf.HiveConf: HiveConf of name hive.mapred.supports.subdirectories does not exist
Logging initialized using configuration in file:/etc/hive/2.5.0.0-1245/0/hive-log4j.properties
[Fatal Error] :1:21: Open quote is expected for attribute "xmlns:xsi" associated with an  element type  "tns:root".
FAILED: SemanticException [Error 10014]: Line 1:8 Wrong arguments ''root/second/four'': org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public org.apache.hadoop.io.Text org.apache.hadoop.hive.ql.udf.xml.UDFXPathString.evaluate(java.lang.String,java.lang.String)  on object org.apache.hadoop.hive.ql.udf.xml.UDFXPathString@629984eb of class org.apache.hadoop.hive.ql.udf.xml.UDFXPathString with arguments {<tns:root xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance xmlns:tns=http://test.com xmlns=http://xmlns.oracle.com/pcbpel/adapter/noname> <tns:second><tns:third>10379</tns:third><tns:four>stats</tns:four><tns:five>1</tns:five><tns:six><tns:DokumentFilIndhold>K</tns:DokumentFilIndhold></tns:six><tns:seven>2018-06-28T12:57:36</tns:seven><tns:eight>2018-06-28T13:02:28</tns:eight></tns:second></tns:root>:java.lang.String, root/second/four:java.lang.String} of size 2

I have tried it on a newer installation as well, just a few builds older than yours

Beeline version 1.2.1000.2.6.4.0-91 by Apache Hive

Connected to: Apache Hive (version 1.2.1000.2.6.4.0-91)

Driver: Hive JDBC (version 1.2.1000.2.6.4.0-91)

But still dont return anything.

Fortunately im leaving for vacation today and won't be looking into this the next 2 weeks, i had hoped to solve this before,

Thanks for your help and suggestions.

avatar

Hi @Simon Jespersen!
Oh man, sorry I forgot to escape the double quotes inside the XML 😞

hive -e "select xpath_string('<tns:root xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:tns=\"http://test.com\" xmlns=\"http://xmlns.oracle.com/pcbpel/adapter/noname\"> <tns:second><tns:third>10379</tns:third><tns:four>stats</tns:four><tns:five>1</tns:five><tns:six><tns:DokumentFilIndhold>K</tns:DokumentFilIndhold></tns:six><tns:seven>2018-06-28T12:57:36</tns:seven><tns:eight>2018-06-28T13:02:28</tns:eight></tns:second></tns:root>','root/second/four'); "

It should work, and now that you mentioned about the hive version. I'm wondering if isn't some set value for Hive.
What you can try is to check your hive client properties (you can use the example below).

hive -e "set;" > hive.properties

I am attaching mine for you to compare.

hive.txt

Enjoy your PTO!
Hope this helps when you get back 🙂

avatar
New Contributor

If you have to convert simple XML files then this approach works well. However, if you have a very large volume of XML files that are based on an industry data standard such as FpML, HL7 etc. then this manual approach becomes very time-consuming. I am also facing in custom nursing essay very must so what I did, I Use 3rd party tools that automate the whole XML conversion process on various big data frameworks such as Hive, Impala, Spark etc.