Created 09-28-2017 01:44 PM
I have created an XML Table with XML Serde in Hive (I'm using hortonworks 2.6.1):
CREATE EXTERNAL TABLE xml_factura(id STRING, version STRING, ambiente STRING, tipoEmision STRING, razonSocial STRING, nombreComercial STRING, ruc STRING,claveAcceso STRING,codDoc STRING,estab STRING, ptoEmi STRING,secuencial STRING,dirMatriz STRING,fechaEmision STRING,dirEstablecimiento STRING, contribuyenteEspecial STRING, obligadoContabilidad STRING,tipoIdentificacionComprador STRING, razonSocialComprador STRING,identificacionComprador STRING,totalSinImpuestos STRING, totalDescuento STRING,totalImpuesto ARRAY <map<STRING,STRING>>,propina STRING, importeTotal STRING,moneda STRING,detalle ARRAY <map<STRING,STRING>>, infoAdicional STRUCT<nombre:STRING,campoAdicional:STRING>) ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe' WITH SERDEPROPERTIES ( "column.xpath.id"="/factura/@id", "column.xpath.version"="/factura/@version", "column.xpath.ambiente"="/factura/infoTributaria/ambiente/text()", "column.xpath.tipoEmision"="/factura/infoTributaria/tipoEmision/text()", "column.xpath.razonSocial"="/factura/infoTributaria/razonSocial/text()", "column.xpath.nombreComercial"="/factura/infoTributaria/nombreComercial/text()", "column.xpath.ruc"="/factura/infoTributaria/ruc/text()", "column.xpath.razonSocial"="/factura/infoTributaria/razonSocial/text()", "column.xpath.claveAcceso"="/factura/infoTributaria/claveAcceso/text()", "column.xpath.codDoc"="/factura/infoTributaria/codDoc/text()", "column.xpath.estab"="/factura/infoTributaria/estab/text()", "column.xpath.ptoEmi"="/factura/infoTributaria/ptoEmi/text()", "column.xpath.secuencial"="/factura/infoTributaria/secuencial/text()", "column.xpath.dirMatriz"="/factura/infoTributaria/dirMatriz/text()", "column.xpath.fechaEmision"="/factura/infoFactura/fechaEmision/text()", "column.xpath.dirEstablecimiento"="/factura/infoFactura/dirEstablecimiento/text()", "column.xpath.contribuyenteEspecial"="/factura/infoFactura/contribuyenteEspecial/text()", "column.xpath.obligadoContabilidad"="/factura/infoFactura/obligadoContabilidad/text()", "column.xpath.tipoIdentificacionComprador"="/factura/infoFactura/tipoIdentificacionComprador/text()", "column.xpath.razonSocialComprador"="/factura/infoFactura/razonSocialComprador/text()", "column.xpath.identificacionComprador"="/factura/infoFactura/identificacionComprador/text()", "column.xpath.totalSinImpuestos"="/factura/infoFactura/totalSinImpuestos/text()", "column.xpath.totalDescuento"="/factura/infoFactura/totalDescuento/text()", "column.xpath.totalImpuesto"="/factura/infoFactura/totalConImpuestos/totalImpuesto", "column.xpath.propina"="/factura/infoFactura/propina/text()", "column.xpath.importeTotal"="/factura/infoFactura/importeTotal/text()", "column.xpath.moneda"="/factura/infoFactura/moneda/text()", "column.xpath.detalle"="/factura/detalles/detalle", "column.xpath.infoAdicional"="/factura/infoAdicional/campoAdicional" ) STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION '/user/mamv141114/xmlfacturas' TBLPROPERTIES ( "xmlinput.start"="<factura id", "xmlinput.end"="</factura>" )
I have uploaded three xml files and I'm able to Query the table (SELECT * , SELECT count(*), etc...) . Everything works fine.
Then I created the following ORC table:
CREATE TABLE xml_factura_orc (id STRING, version STRING, ambiente STRING, tipoEmision STRING, razonSocial STRING, nombreComercial STRING, ruc STRING,claveAcceso STRING,codDoc STRING,estab STRING, ptoEmi STRING,secuencial STRING,dirMatriz STRING,fechaEmision STRING,dirEstablecimiento STRING, contribuyenteEspecial STRING, obligadoContabilidad STRING,tipoIdentificacionComprador STRING, razonSocialComprador STRING,identificacionComprador STRING,totalSinImpuestos STRING, totalDescuento STRING,totalImpuesto ARRAY <map<STRING,STRING>>,propina STRING, importeTotal STRING,moneda STRING, detalle ARRAY <map<STRING,STRING>>, infoAdicional STRUCT<nombre:STRING,campoAdicional:STRING>) STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
When I try to insert all the data from this table into this ORC table with this sentence:
INSERT OVERWRITE TABLE facturas.xml_factura_orc SELECT id, version, ambiente, tipoEmision, razonSocial, nombreComercial, ruc, claveAcceso, codDoc, estab, ptoEmi, secuencial, dirMatriz, fechaEmision, dirEstablecimiento, contribuyenteEspecial, obligadoContabilidad, tipoIdentificacionComprador, razonSocialComprador, identificacionComprador, totalSinImpuestos, totalDescuento, totalImpuesto, propina, importeTotal, moneda, detalle, infoAdicional FROM facturas.xml_factura;
I get the following error:
Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1506550557761_0005_1_00, diagnostics=[Vertex vertex_1506550557761_0005_1_00 [Map 1] killed/failed due to:INIT_FAILURE, Fail to create InputInitializerManager, org.apache.tez.dag.api.TezReflectionException: Unable to instantiate class with 1 arguments: org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator at org.apache.tez.common.ReflectionUtils.getNewInstance(ReflectionUtils.java:70) at org.apache.tez.common.ReflectionUtils.createClazzInstance(ReflectionUtils.java:89) at org.apache.tez.dag.app.dag.RootInputInitializerManager$1.run(RootInputInitializerManager.java:151) at org.apache.tez.dag.app.dag.RootInputInitializerManager$1.run(RootInputInitializerManager.java:148) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.tez.dag.app.dag.RootInputInitializerManager.createInitializer(RootInputInitializerManager.java:148) at org.apache.tez.dag.app.dag.RootInputInitializerManager.runInputInitializers(RootInputInitializerManager.java:121) at org.apache.tez.dag.app.dag.impl.VertexImpl.setupInputInitializerManager(VertexImpl.java:4620) at org.apache.tez.dag.app.dag.impl.VertexImpl.access$4400(VertexImpl.java:202) at org.apache.tez.dag.app.dag.impl.VertexImpl$InitTransition.handleInitEvent(VertexImpl.java:3436) at org.apache.tez.dag.app.dag.impl.VertexImpl$InitTransition.transition(VertexImpl.java:3385) at org.apache.tez.dag.app.dag.impl.VertexImpl$InitTransition.transition(VertexImpl.java:3366) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57) at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1938) at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:201) at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:2080) at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:2066) at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.tez.common.ReflectionUtils.getNewInstance(ReflectionUtils.java:68) ... 25 more Caused by: java.lang.RuntimeException: Failed to load plan: hdfs://ndatos01.sri.ad:8020/tmp/hive/mamv141114/1762793d-62ac-45d8-ae09-2900aa4851f9/hive_2017-09-28_08-14-12_349_3611498623986481792-11/mamv141114/_tez_scratch_dir/7bed50d7-f0f0-424d-ab95-5d9797635eba/map.xml: org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: com.ibm.spss.hive.serde2.xml.XmlInputFormat Serialization trace: inputFileFormatClass (org.apache.hadoop.hive.ql.plan.PartitionDesc) aliasToPartnInfo (org.apache.hadoop.hive.ql.plan.MapWork) at org.apache.hadoop.hive.ql.exec.Utilities.getBaseWork(Utilities.java:479) at org.apache.hadoop.hive.ql.exec.Utilities.getMapWork(Utilities.java:318) at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.<init>(HiveSplitGenerator.java:101) ... 30 more Caused by: org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: com.ibm.spss.hive.serde2.xml.XmlInputFormat Serialization trace: inputFileFormatClass (org.apache.hadoop.hive.ql.plan.PartitionDesc) aliasToPartnInfo (org.apache.hadoop.hive.ql.plan.MapWork) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:656) at org.apache.hive.com.esotericsoftware.kryo.serializers.DefaultSerializers$ClassSerializer.read(DefaultSerializers.java:238) at org.apache.hive.com.esotericsoftware.kryo.serializers.DefaultSerializers$ClassSerializer.read(DefaultSerializers.java:226) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:745) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:113) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776) at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:139) at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694) at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106) at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507) at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:672) at org.apache.hadoop.hive.ql.exec.Utilities.deserializeObjectByKryo(Utilities.java:1182) at org.apache.hadoop.hive.ql.exec.Utilities.deserializePlan(Utilities.java:1069) at org.apache.hadoop.hive.ql.exec.Utilities.deserializePlan(Utilities.java:1083) at org.apache.hadoop.hive.ql.exec.Utilities.getBaseWork(Utilities.java:439) ... 32 more Caused by: java.lang.ClassNotFoundException: com.ibm.spss.hive.serde2.xml.XmlInputFormat at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:136) ... 50 more ]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0 (state=08S01,code=2)
¿Where do I have to put the hivexmlserde.jar so this INSERT works? ¿Has anyone faced the same problem? I have copied the jar in several lib directories and have no success, so I think there's another place where I have to copy this jar
¿Is there another way to read a batch of XML files and insert into an ORC table or another hive table?
I really appreciate if anyone can help me.
Created 06-25-2018 05:17 PM
Were you able to figure out ??? I am facing the same issue. @Miguel Marquez
Could you please let me know , how did it work??
Created 06-25-2018 05:36 PM
Actually I was waiting for an answer. My problem was not solved. I switched to another solution that doesn't involve XML Files. Sorry @Anji Raju
Created 06-25-2018 07:28 PM
@Shu Any idea ???
Created 08-08-2018 08:08 PM
I am facing the same issue were you able to resolve this @Anji Raju
Created 06-03-2020 08:00 AM
Hi I have created a Hive table with xml SerDe.
How did you load xml data to your xml_factura table?
Thanks in advance.
Created 06-03-2020 08:12 AM
Hi, I found a way to do it. thanks