About RobXC

RobXC · ‎02-06-2019

I have a table that I'm querying with an array<String> column and queries are extremely slow when accessing the complex type. The table is partitioned and Impala is doing a broadcast cross join on the "array" component and scanning ALL partitions, wherein the base part of the query, it's pruning the partitions based on the partition key and only scanning the partition I'm querying on. Here is the query: select rta.transaction_purchase_id, rta.cigarette_transaction_flag, rta.non_cig_merch_transaction_flag, bow.item from wdl_atomic.retail_transaction_attribute rta, wdl_atomic.retail_transaction_attribute.retail_offering_material_group_distinct_list bow where rta.fiscal_period_id = 2019001; This is the explain plan: Explain String Max Per-Host Resource Reservation: Memory=0B Per-Host Resource Estimates: Memory=44.61GB PLAN-ROOT SINK | 04:EXCHANGE [UNPARTITIONED] | 02:NESTED LOOP JOIN [CROSS JOIN, BROADCAST] | |--03:EXCHANGE [BROADCAST] | | | 00:SCAN HDFS [retail_transaction_attribute rta] | partitions=1/50 files=40 size=66.49GB | 01:SCAN HDFS [retail_transaction_attribute.retail_offering_material_group_distinct_list bow] partitions=50/50 files=2000 size=763.01GB Is this expected behavior from Impala? It seems odd that it wouldn't also just scan the same partitions in both parts of the query. I also thought that Impala didn't need to perform an additional step to unpack the array into scalar values, which it's doing a CROSS JOIN for and that's generating hundreds of billions of rows to be broadcast. Any insight is appreciated. Thanks!

RobXC · ‎09-25-2018

I'm wondering if anyone can help with my issue. I followed the blog from Eric and I can now submit jobs through oozie with spark2 on yarn. However, when I try to write to a Hive table through spark I'm getting an error. In my session, I enable hive support: sparkBuilder.enableHiveSupport() I'm then trying to run an alter table via spark.sql: transactions.sparkSession.sql(s"ALTER TABLE transactions DROP IF EXISTS PARTITION(OHREQUID_PART = ${r.getInt(0)})") If I run this through spark2-submit, it works fine, but if I run through Ozzie I get the following error: User class threw exception: java.lang.NoSuchMethodError: org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Iface.get_all_functions()Lorg/apache/hadoop/hive/metastore/api/GetAllFunctionsResponse; java.lang.NoSuchMethodError: org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Iface.get_all_functions()Lorg/apache/hadoop/hive/metastore/api/GetAllFunctionsResponse; I've copied all of the spark jars from /opt/cloudera/parcels/SPARK2/lib/spark2/jars/ to /user/oozie/share/lib/lib_<slid>/spark2/, I've uplaoded the hive-site.xml to the same directory, also copied in the oozie-sharelib-spark.jar jar. chown'd all of the files to oozie:oozie. I also have oozie.action.sharelib.for.spark=spark2 set in my properties file. I also made sur ethe jars are showing up in oozie shareliblist spark2 It seems like a dependency collision to me, but I'm not sure which jar would be causing the issue. Thanks for any insights These are the jars I have loaded in the spark2 sharelib related to hive: hdfs://nameservice1/user/oozie/share/lib/lib_20180613213413/spark2/hive-metastore-1.1.0-cdh5.13.3.jar hdfs://nameservice1/user/oozie/share/lib/lib_20180613213413/spark2/hive-serde-1.1.0-cdh5.13.3.jar hdfs://nameservice1/user/oozie/share/lib/lib_20180613213413/spark2/hive-shims-0.23-1.1.0-cdh5.13.3.jar hdfs://nameservice1/user/oozie/share/lib/lib_20180613213413/spark2/hive-shims-1.1.0-cdh5.13.3.jar hdfs://nameservice1/user/oozie/share/lib/lib_20180613213413/spark2/hive-shims-common-1.1.0-cdh5.13.3.jar hdfs://nameservice1/user/oozie/share/lib/lib_20180613213413/spark2/hive-shims-scheduler-1.1.0-cdh5.13.3.jar hdfs://nameservice1/user/oozie/share/lib/lib_20180613213413/spark2/hive-site.xml hdfs://nameservice1/user/oozie/share/lib/lib_20180613213413/spark2/spark-hive-exec_2.11-2.3.0.cloudera2.jar

Online	Offline
Last Visited	‎06-21-2022 11:08 AM

Member Since	‎08-14-2018 06:22 AM
Last Visited	‎06-21-2022 11:08 AM
Posts	3

Cloudera Community

Impala performance on table with array complex typ...

Re: Spark 2 on YARN via Oozie Hue wrong version