Member since
07-01-2015
460
Posts
78
Kudos Received
43
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 1753 | 11-26-2019 11:47 PM | |
| 1634 | 11-25-2019 11:44 AM | |
| 10734 | 08-07-2019 12:48 AM | |
| 2640 | 04-17-2019 03:09 AM | |
| 4349 | 02-18-2019 12:23 AM |
10-04-2018
07:11 AM
1 Kudo
sudo pip install --upgrade setuptools resolved the issue
... View more
10-03-2018
01:38 AM
Hi Cloudera Impala community,
we have many join queries between Impala (HDFS) and Kudu datasets where the large kudu table is joined with a small HDFS table. I see in many cases, that the HDFS dataset condition returns 0 rows, but the query still scans all the 600mil records in Kudu.
For example this query is running for 10min+ and it is clear that the right part of the query, the broadcasted dataset is empty, zero rows. So the optimizer could just skip the scanning and return empty dataset.
Query plan
Max Per-Host Resource Reservation: Memory=17.00MB
Per-Host Resource Estimates: Memory=193.00MB
F02:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
| Per-Host Resources: mem-estimate=0B mem-reservation=0B
PLAN-ROOT SINK
| mem-estimate=0B mem-reservation=0B
|
04:EXCHANGE [UNPARTITIONED]
| mem-estimate=0B mem-reservation=0B
| tuple-ids=0,1 row-size=1.62KB cardinality=88114
|
F00:PLAN FRAGMENT [RANDOM] hosts=9 instances=9
Per-Host Resources: mem-estimate=17.00MB mem-reservation=17.00MB
02:HASH JOIN [INNER JOIN, BROADCAST]
| hash predicates: tab.key1 = pk.key1, tab.key2 = pk.key2, tab.key3 = pk.key3, tab.key4 = pk.key4, tab.root_key3 = pk.root_key3
| fk/pk conjuncts: tab.key1 = pk.key1, tab.key2 = pk.key2, tab.key3 = pk.key3, tab.key4 = pk.key4, tab.root_key3 = pk.root_key3
| mem-estimate=17.00MB mem-reservation=17.00MB spill-buffer=1.00MB
| tuple-ids=0,1 row-size=1.62KB cardinality=88114
|
|--03:EXCHANGE [BROADCAST]
| | mem-estimate=0B mem-reservation=0B
| | tuple-ids=1 row-size=159B cardinality=88114
| |
| F01:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
| Per-Host Resources: mem-estimate=176.00MB mem-reservation=0B
| 01:SCAN HDFS [work.pk_table_pk pk, RANDOM]
| partitions=1/1 files=1 size=2.51MB
| predicates: pk.rn = 1, pk.tr_code IN ('D')
| stats-rows=249225 extrapolated-rows=disabled
| table stats: rows=249225 size=2.51MB
| column stats: all
| parquet statistics predicates: pk.rn = 1, pk.tr_code IN ('D')
| parquet dictionary predicates: pk.rn = 1, pk.tr_code IN ('D')
| mem-estimate=176.00MB mem-reservation=0B
| tuple-ids=1 row-size=159B cardinality=88114
|
00:SCAN KUDU [base.large_table tab]
mem-estimate=0B mem-reservation=0B
tuple-ids=0 row-size=1.46KB cardinality=674357713
The query:
select *
from base.large_table tab join
work.pk_table_pk pk ON
tab.`key1` = pk.`key1` and
tab.`key2` = pk.`key2` and
tab.`key3` = pk.`key3` and
tab.`key4` = pk.`key4` and
tab.`root_key3` = pk.`root_key3`
where pk.tr_code in ('D') and pk.rn =1
This is running on CDH 5.13. Could this be optimized in the future version of Impala?
Thanks
... View more
Labels:
- Labels:
-
Apache Impala
-
Apache Kudu
-
HDFS
10-01-2018
11:54 PM
Side question: what is a difference between coordinator and executor?
... View more
10-01-2018
11:41 PM
Cloud VM ami-3548444c plus yum -y update
... View more
10-01-2018
12:41 AM
Hi Cloudera employees,
please can somebody have a look at this error, why the new installation fails?
OS: CentOS Linux release 7.5.1804 (Core)
Thanks
sudo pip install cloudera-director-python-client
copying build/lib/cloudera/director/v9/models/warning_info.py -> build/bdist.linux-x86_64/wheel/cloudera/director/v9/models
running install_egg_info
Copying cloudera_director_python_client.egg-info to build/bdist.linux-x86_64/wheel/cloudera_director_python_client-6.0.0-py2.7.egg-info
running install_scripts
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-install-rWjXLN/cloudera-director-python-client/setup.py", line 47, in <module>
"""
File "/usr/lib64/python2.7/distutils/core.py", line 152, in setup
dist.run_commands()
File "/usr/lib64/python2.7/distutils/dist.py", line 953, in run_commands
self.run_command(cmd)
File "/usr/lib64/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/usr/lib/python2.7/site-packages/wheel/bdist_wheel.py", line 237, in run
self.egg2dist(self.egginfo_dir, distinfo_dir)
File "/usr/lib/python2.7/site-packages/wheel/bdist_wheel.py", line 344, in egg2dist
pkg_info = pkginfo_to_metadata(egginfo_path, pkginfo_path)
File "/usr/lib/python2.7/site-packages/wheel/metadata.py", line 88, in pkginfo_to_metadata
for key, value in generate_requirements({extra: reqs}):
File "/usr/lib/python2.7/site-packages/wheel/metadata.py", line 67, in generate_requirements
for new_req in convert_requirements(depends):
File "/usr/lib/python2.7/site-packages/wheel/metadata.py", line 36, in convert_requirements
spec = requires_to_requires_dist(parsed_requirement)
File "/usr/lib/python2.7/site-packages/wheel/metadata.py", line 21, in requires_to_requires_dist
if requirement.url:
AttributeError: Requirement instance has no attribute 'url'
----------------------------------------
Failed building wheel for cloudera-director-python-client
Running setup.py clean for cloudera-director-python-client
Failed to build cloudera-director-python-client
Installing collected packages: urllib3, six, certifi, python-dateutil, cloudera-director-python-client
Found existing installation: urllib3 1.10.2
Uninstalling urllib3-1.10.2:
Successfully uninstalled urllib3-1.10.2
Found existing installation: six 1.9.0
Uninstalling six-1.9.0:
Successfully uninstalled six-1.9.0
Running setup.py install for cloudera-director-python-client ... done
Successfully installed certifi-2018.8.24 cloudera-director-python-client-6.0.0 python-dateutil-2.7.3 six-1.11.0 urllib3-1.23
... View more
09-26-2018
03:42 AM
The first thing what came to my mind, do you have a valid ticket on all the spark/talend nodes? If not, then that is probably the root cause.
... View more
09-20-2018
12:10 PM
Regarding the 1st point, are you using Sentry? I am quite sure, that in correct configuration the HDFS is not browsable from Hue if the owner/group does not match. Or.. maybe you have ACLs enabled on HDFS, and on this directory from the print screen there are more permissions. If you have acls enabled, then check it by hdfs dfs -getfacl <path>
... View more
09-20-2018
12:05 PM
Try to use https://github.com/rcongiu/Hive-JSON-Serde#jsonserde---a-readwrite-serde-for-json-data
... View more
09-20-2018
12:02 PM
Hi, I have tested the correlated and uncorrelated subqueries in Hive CDH 5.13 and you are right. CORRELATED: First the data and test in Impala: [10.197.0.0:21000] > create table work.tbla ( id int, s string );
[10.197.0.0:21000] > create table work.tblb ( id int, s string );
[10.197.0.0:21000] > insert into work.tbla select 100, '2008-01-01' union select 200, '2009-12-01';
Modified 2 row(s) in 3.69s
[10.197.0.0:21000] > insert into work.tblb select 100, '2008-01-01' union select 200, '2009-12-01';
Modified 2 row(s) in 3.63s
[10.197.0.0:21000] > select * from work.tbla a where exists ( select null from work.tblb b where a.id = b.id and b.s > '2009-10-01' );
+-----+------------+
| id | s |
+-----+------------+
| 200 | 2009-12-01 |
+-----+------------+ Now in Hive: 0: jdbc:hive2://xxx.com> explain select * from work.tbla a where exists ( select null from work.tblb b where a.id = b.id and b.s > '2009-10-01' );
..
INFO : OK
+----------------------------------------------------+--+
| Explain |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-4 is a root stage |
| Stage-3 depends on stages: Stage-4 |
| Stage-0 depends on stages: Stage-3 |
| |
| STAGE PLANS: |
| Stage: Stage-4 |
| Map Reduce Local Work |
| Alias -> Map Local Tables: |
| sq_1:b |
| Fetch Operator |
| limit: -1 |
| Alias -> Map Local Operator Tree: |
| sq_1:b |
| TableScan |
| alias: b |
| filterExpr: ((s > '2009-10-01') and id is not null) (type: boolean) |
| Statistics: Num rows: 1 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: ((s > '2009-10-01') and id is not null) (type: boolean) |
| Statistics: Num rows: 1 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: id (type: int) |
| outputColumnNames: _col1 |
| Statistics: Num rows: 1 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| keys: _col1 (type: int) |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
| HashTable Sink Operator |
| keys: |
| 0 id (type: int) |
| 1 _col0 (type: int) |
| |
| Stage: Stage-3 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| filterExpr: id is not null (type: boolean) |
| Statistics: Num rows: 1 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: id is not null (type: boolean) |
| Statistics: Num rows: 1 Data size: 30 Basic stats: COMPLETE Column stats: NONE |
| Map Join Operator |
| condition map: |
| Left Semi Join 0 to 1 |
| keys: |
| 0 id (type: int) |
| 1 _col0 (type: int) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 1 Data size: 33 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 1 Data size: 33 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| Local Work: |
| Map Reduce Local Work |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+--+
69 rows selected (0.311 seconds) UNCORRELATED - Hive fails: 0: jdbc:hive2://xxxx.com> explain select * from work.tbla a where exists ( select null from work.tblb b where b.s > '2009-10-01' );
Error: Error while compiling statement: FAILED: SemanticException Line 1:65 Invalid SubQuery expression ''2009-10-01'' in definition of SubQuery sq_1 [
exists ( select null from work.tblb b where b.s > '2009-10-01' )
] used as sq_1 at Line 1:39: For Exists/Not Exists operator SubQuery must be Correlated. (state=42000,code=40000) Impala ok [10.197.0.0:21000] > select * from work.tbla a where exists ( select null from work.tblb b where b.s > '2009-10-01' );
+-----+------------+
| id | s |
+-----+------------+
| 100 | 2008-01-01 |
| 200 | 2009-12-01 |
+-----+------------+
Fetched 2 row(s) in 0.19s
... View more
09-20-2018
11:49 AM
You have the explanation probably on the last line of the error message. Try to request the soft limits for the number of instances.
... View more