Member since
08-14-2018
3
Posts
0
Kudos Received
0
Solutions
02-06-2019
05:22 PM
I have a table that I'm querying with an array<String> column and queries are extremely slow when accessing the complex type. The table is partitioned and Impala is doing a broadcast cross join on the "array" component and scanning ALL partitions, wherein the base part of the query, it's pruning the partitions based on the partition key and only scanning the partition I'm querying on.
Here is the query:
select rta.transaction_purchase_id, rta.cigarette_transaction_flag, rta.non_cig_merch_transaction_flag, bow.item
from wdl_atomic.retail_transaction_attribute rta, wdl_atomic.retail_transaction_attribute.retail_offering_material_group_distinct_list bow
where rta.fiscal_period_id = 2019001;
This is the explain plan:
Explain String
Max Per-Host Resource Reservation: Memory=0B
Per-Host Resource Estimates: Memory=44.61GB
PLAN-ROOT SINK
|
04:EXCHANGE [UNPARTITIONED]
|
02:NESTED LOOP JOIN [CROSS JOIN, BROADCAST]
|
|--03:EXCHANGE [BROADCAST]
| |
| 00:SCAN HDFS [retail_transaction_attribute rta]
| partitions=1/50 files=40 size=66.49GB
|
01:SCAN HDFS [retail_transaction_attribute.retail_offering_material_group_distinct_list bow]
partitions=50/50 files=2000 size=763.01GB
Is this expected behavior from Impala? It seems odd that it wouldn't also just scan the same partitions in both parts of the query. I also thought that Impala didn't need to perform an additional step to unpack the array into scalar values, which it's doing a CROSS JOIN for and that's generating hundreds of billions of rows to be broadcast.
Any insight is appreciated.
Thanks!
... View more
Labels:
- Labels:
-
Apache Impala
-
HDFS
09-25-2018
10:51 AM
I'm wondering if anyone can help with my issue. I followed the blog from Eric and I can now submit jobs through oozie with spark2 on yarn. However, when I try to write to a Hive table through spark I'm getting an error. In my session, I enable hive support: sparkBuilder.enableHiveSupport() I'm then trying to run an alter table via spark.sql: transactions.sparkSession.sql(s"ALTER TABLE transactions DROP IF EXISTS PARTITION(OHREQUID_PART = ${r.getInt(0)})") If I run this through spark2-submit, it works fine, but if I run through Ozzie I get the following error: User class threw exception: java.lang.NoSuchMethodError: org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Iface.get_all_functions()Lorg/apache/hadoop/hive/metastore/api/GetAllFunctionsResponse;
java.lang.NoSuchMethodError: org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Iface.get_all_functions()Lorg/apache/hadoop/hive/metastore/api/GetAllFunctionsResponse; I've copied all of the spark jars from /opt/cloudera/parcels/SPARK2/lib/spark2/jars/ to /user/oozie/share/lib/lib_<slid>/spark2/, I've uplaoded the hive-site.xml to the same directory, also copied in the oozie-sharelib-spark.jar jar. chown'd all of the files to oozie:oozie. I also have oozie.action.sharelib.for.spark=spark2 set in my properties file. I also made sur ethe jars are showing up in oozie shareliblist spark2 It seems like a dependency collision to me, but I'm not sure which jar would be causing the issue. Thanks for any insights These are the jars I have loaded in the spark2 sharelib related to hive: hdfs://nameservice1/user/oozie/share/lib/lib_20180613213413/spark2/hive-metastore-1.1.0-cdh5.13.3.jar hdfs://nameservice1/user/oozie/share/lib/lib_20180613213413/spark2/hive-serde-1.1.0-cdh5.13.3.jar hdfs://nameservice1/user/oozie/share/lib/lib_20180613213413/spark2/hive-shims-0.23-1.1.0-cdh5.13.3.jar hdfs://nameservice1/user/oozie/share/lib/lib_20180613213413/spark2/hive-shims-1.1.0-cdh5.13.3.jar hdfs://nameservice1/user/oozie/share/lib/lib_20180613213413/spark2/hive-shims-common-1.1.0-cdh5.13.3.jar hdfs://nameservice1/user/oozie/share/lib/lib_20180613213413/spark2/hive-shims-scheduler-1.1.0-cdh5.13.3.jar hdfs://nameservice1/user/oozie/share/lib/lib_20180613213413/spark2/hive-site.xml hdfs://nameservice1/user/oozie/share/lib/lib_20180613213413/spark2/spark-hive-exec_2.11-2.3.0.cloudera2.jar
... View more