Created 06-29-2018 01:15 PM
<tns:root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tns="http://test.com" xmlns="http://xmlns.oracle.com/pcbpel/adapter/noname"> <tns:second> <tns:third>10379</tns:third> <tns:four>stats</tns:four> <tns:five>1</tns:five> <tns:six> <tns:DokumentFilIndhold>K</tns:DokumentFilIndhold> </tns:six> <tns:seven>2018-06-28T12:57:36</tns:seven> <tns:eight>2018-06-28T13:02:28</tns:eight> </tns:second> </tns:root>
This is my test xml,
Testdb is an external table on hbase on hadoop closter
i have tried to select the value of element "four" from my hive table where column contents is the xml
use testdb; select FROM_UNIXTIME(CEIL((CAST(SPLIT(row_key, '\\|')[0] AS BIGINT))/1000)) AS received_at , SPLIT(row_key, '\\|')[1] AS session_id ,row_key ,xpath_string(contents,'//tns:root/tns:second/tns:third/tns:four' ) AS test where row_key = '1530262082747|08004d28-3cf6-4446-bae1-93d43c08c189';
But it returns empty string value, from my research i can se it probably has something to do with the namespace, but i haven't found anything useful out there yet.
does anyone know of any way to either work with the namespace or ignore it in the hive query
Created 06-29-2018 09:20 PM
Hey @Simon Jespersen!
Try to change your //tns:root/tns:second/tns:third/tns:four for root/second/four
I made a simple test here:
hive> select xpath_string('<tns:root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tns="http://test.com" xmlns="http://xmlns.oracle.com/pcbpel/adapter/noname"> <tns:second><tns:third>10379</tns:third><tns:four>stats</tns:four><tns:five>1</tns:five><tns:six><tns:DokumentFilIndhold>K</tns:DokumentFilIndhold></tns:six><tns:seven>2018-06-28T12:57:36</tns:seven><tns:eight>2018-06-28T13:02:28</tns:eight></tns:second></tns:root>','root/second/four'); OK stats
Hope this helps!
Created 07-02-2018 07:06 AM
Hi
Thanks, but i my hive version it doesn't return anything
select xpath_string('<tns:root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tns="http://test.com" xmlns="http://xmlns.oracle.com/pcbpel/adapter/noname"> <tns:second><tns:third>10379</tns:third><tns:four>stats</tns:four><tns:five>1</tns:five><tns:six><tns:DokumentFilIndhold>K</tns:DokumentFilIndhold></tns:six><tns:seven>2018-06-28T12:57:36</tns:seven><tns:eight>2018-06-28T13:02:28</tns:eight></tns:second></tns:root>','/root/second/four');
Returns ""
Created 07-05-2018 06:05 AM
Hey @Simon Jespersen!
I'm so sorry for the long delay 😞
So regarding your issue, try to take off the first "/" on the /root/second/four (2nd param for the xpath_string).
Instead of '/root/second/four')would be 'root/second/four')
Hope this helps!
Created 07-05-2018 06:58 AM
Thank you very much @Vinicius Higa Murakami
When i run this in beeline
select xpath_string('<tns:root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tns="http://test.com" xmlns="http://xmlns.oracle.com/pcbpel/adapter/noname"> <tns:second><tns:third>10379</tns:third><tns:four>stats</tns:four><tns:five>1</tns:five><tns:six><tns:DokumentFilIndhold>K</tns:DokumentFilIndhold></tns:six><tns:seven>2018-06-28T12:57:36</tns:seven><tns:eight>2018-06-28T13:02:28</tns:eight></tns:second></tns:root>','root/second/four');
It returns following
Connected to: Apache Hive (version 1.2.1000.2.5.0.0-1245) Driver: Hive JDBC (version 1.2.1000.2.5.0.0-1245) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://sktudv01hdp01.ccta.dk:2181,sk> use adm_sfo_sit; No rows affected (3.891 seconds) 0: jdbc:hive2://sktudv01hdp01.ccta.dk:2181,sk> select xpath_string('<tns:root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tns="http://test.com" xmlns="http://xmlns.oracle.com/pcbpel/adapter/noname"> <tns:second><tns:third>10379</tns:third><tns:four>stats</tns:four><tns:five>1</tns:five><tns:six><tns:DokumentFilIndhold>K</tns:DokumentFilIndhold></tns:six><tns:seven>2018-06-28T12:57:36</tns:seven><tns:eight>2018-06-28T13:02:28</tns:eight></tns:second></tns:root>','root/second/four'); +------+--+ | _c0 | +------+--+ | | +------+--+ 1 row selected (0.076 seconds)
Created 07-06-2018 05:41 AM
Hi @Simon Jespersen!
Hmmm that's quite strange 😞
I made the same test as you using beeline this time and still having the stat value:
[root@node3 ~]# beeline -u 'jdbc:hive2://node3:10000/default' -n hive Connecting to jdbc:hive2://node3:10000/default Connected to: Apache Hive (version 1.2.1000.2.6.5.0-292) Driver: Hive JDBC (version 1.2.1000.2.6.5.0-292) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.2.1000.2.6.5.0-292 by Apache Hive 0: jdbc:hive2://node3:10000/default> select xpath_string('<tns:root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tns="http://test.com" xmlns="http://xmlns.oracle.com/pcbpel/adapter/noname"> <tns:second><tns:third>10379</tns:third><tns:four>stats</tns:four><tns:five>1</tns:five><tns:six><tns:DokumentFilIndhold>K</tns:DokumentFilIndhold></tns:six><tns:seven>2018-06-28T12:57:36</tns:seven><tns:eight>2018-06-28T13:02:28</tns:eight></tns:second></tns:root>','root/second/four'); +--------+--+ | _c0 | +--------+--+ | stats | +--------+--+ 1 row selected (3.282 seconds) 0: jdbc:hive2://node3:10000/default> select xpath_string('<tns:root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tns="http://test.com" xmlns="http://xmlns.oracle.com/pcbpel/adapter/noname"> <tns:second><tns:third>10379</tns:third><tns:four>stats</tns:four><tns:five>1</tns:five><tns:six><tns:DokumentFilIndhold>K</tns:DokumentFilIndhold></tns:six><tns:seven>2018-06-28T12:57:36</tns:seven><tns:eight>2018-06-28T13:02:28</tns:eight></tns:second></tns:root>','root/second/four'); +--------+--+ | _c0 | +--------+--+ | stats | +--------+--+ 1 row selected (0.54 seconds)
Do you mind to test it using hiveCLI? Would be something like this:
hive -e "select xpath_string('<tns:root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tns="http://test.com" xmlns="http://xmlns.oracle.com/pcbpel/adapter/noname"> <tns:second><tns:third>10379</tns:third><tns:four>stats</tns:four><tns:five>1</tns:five><tns:six><tns:DokumentFilIndhold>K</tns:DokumentFilIndhold></tns:six><tns:seven>2018-06-28T12:57:36</tns:seven><tns:eight>2018-06-28T13:02:28</tns:eight></tns:second></tns:root>','root/second/four');"
The only thing that set us apart is the minor number on the hive version:
Mine version 1.2.1000.2.6.5.0-292
Your version 1.2.1000.2.5.0.0-1245
This shouldn't be a problem, anyway I'm gonna check if I find smtg useful between these two versions.
Hope this helps!
Created 07-06-2018 06:39 AM
When i run it on hive cli, it failed
hive -e " select xpath_string('<tns:root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tns="http://test.com" xmlns="http://xmlns.oracle.com/pcbpel/adapter/noname"> <tns:second><tns:third>10379</tns:third><tns:four>stats</tns:four><tns:five>1</tns:five><tns:six><tns:DokumentFilIndhold>K</tns:DokumentFilIndhold></tns:six><tns:seven>2018-06-28T12:57:36</tns:seven><tns:eight>2018-06-28T13:02:28</tns:eight></tns:second></tns:root>','root/second/four');" 2018-07-06 08:27:22,044 WARN [main] conf.HiveConf: HiveConf of name hive.mapred.supports.subdirectories does not exist Logging initialized using configuration in file:/etc/hive/2.5.0.0-1245/0/hive-log4j.properties [Fatal Error] :1:21: Open quote is expected for attribute "xmlns:xsi" associated with an element type "tns:root". FAILED: SemanticException [Error 10014]: Line 1:8 Wrong arguments ''root/second/four'': org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public org.apache.hadoop.io.Text org.apache.hadoop.hive.ql.udf.xml.UDFXPathString.evaluate(java.lang.String,java.lang.String) on object org.apache.hadoop.hive.ql.udf.xml.UDFXPathString@629984eb of class org.apache.hadoop.hive.ql.udf.xml.UDFXPathString with arguments {<tns:root xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance xmlns:tns=http://test.com xmlns=http://xmlns.oracle.com/pcbpel/adapter/noname> <tns:second><tns:third>10379</tns:third><tns:four>stats</tns:four><tns:five>1</tns:five><tns:six><tns:DokumentFilIndhold>K</tns:DokumentFilIndhold></tns:six><tns:seven>2018-06-28T12:57:36</tns:seven><tns:eight>2018-06-28T13:02:28</tns:eight></tns:second></tns:root>:java.lang.String, root/second/four:java.lang.String} of size 2
I have tried it on a newer installation as well, just a few builds older than yours
Beeline version 1.2.1000.2.6.4.0-91 by Apache Hive
Connected to: Apache Hive (version 1.2.1000.2.6.4.0-91)
Driver: Hive JDBC (version 1.2.1000.2.6.4.0-91)
But still dont return anything.
Fortunately im leaving for vacation today and won't be looking into this the next 2 weeks, i had hoped to solve this before,
Thanks for your help and suggestions.
Created 07-07-2018 08:21 AM
Hi @Simon Jespersen!
Oh man, sorry I forgot to escape the double quotes inside the XML 😞
hive -e "select xpath_string('<tns:root xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:tns=\"http://test.com\" xmlns=\"http://xmlns.oracle.com/pcbpel/adapter/noname\"> <tns:second><tns:third>10379</tns:third><tns:four>stats</tns:four><tns:five>1</tns:five><tns:six><tns:DokumentFilIndhold>K</tns:DokumentFilIndhold></tns:six><tns:seven>2018-06-28T12:57:36</tns:seven><tns:eight>2018-06-28T13:02:28</tns:eight></tns:second></tns:root>','root/second/four'); "
It should work, and now that you mentioned about the hive version. I'm wondering if isn't some set value for Hive.
What you can try is to check your hive client properties (you can use the example below).
hive -e "set;" > hive.properties
I am attaching mine for you to compare.
Enjoy your PTO!
Hope this helps when you get back 🙂
Created 07-20-2018 08:03 AM
If you have to convert simple XML files then this approach works well. However, if you have a very large volume of XML files that are based on an industry data standard such as FpML, HL7 etc. then this manual approach becomes very time-consuming. I am also facing in custom nursing essay very must so what I did, I Use 3rd party tools that automate the whole XML conversion process on various big data frameworks such as Hive, Impala, Spark etc.