About Tomas79

Tomas79 · ‎10-04-2018

sudo pip install --upgrade setuptools resolved the issue

Tomas79 · ‎10-03-2018

Hi Cloudera Impala community, we have many join queries between Impala (HDFS) and Kudu datasets where the large kudu table is joined with a small HDFS table. I see in many cases, that the HDFS dataset condition returns 0 rows, but the query still scans all the 600mil records in Kudu. For example this query is running for 10min+ and it is clear that the right part of the query, the broadcasted dataset is empty, zero rows. So the optimizer could just skip the scanning and return empty dataset. Query plan Max Per-Host Resource Reservation: Memory=17.00MB Per-Host Resource Estimates: Memory=193.00MB F02:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 | Per-Host Resources: mem-estimate=0B mem-reservation=0B PLAN-ROOT SINK | mem-estimate=0B mem-reservation=0B | 04:EXCHANGE [UNPARTITIONED] | mem-estimate=0B mem-reservation=0B | tuple-ids=0,1 row-size=1.62KB cardinality=88114 | F00:PLAN FRAGMENT [RANDOM] hosts=9 instances=9 Per-Host Resources: mem-estimate=17.00MB mem-reservation=17.00MB 02:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: tab.key1 = pk.key1, tab.key2 = pk.key2, tab.key3 = pk.key3, tab.key4 = pk.key4, tab.root_key3 = pk.root_key3 | fk/pk conjuncts: tab.key1 = pk.key1, tab.key2 = pk.key2, tab.key3 = pk.key3, tab.key4 = pk.key4, tab.root_key3 = pk.root_key3 | mem-estimate=17.00MB mem-reservation=17.00MB spill-buffer=1.00MB | tuple-ids=0,1 row-size=1.62KB cardinality=88114 | |--03:EXCHANGE [BROADCAST] | | mem-estimate=0B mem-reservation=0B | | tuple-ids=1 row-size=159B cardinality=88114 | | | F01:PLAN FRAGMENT [RANDOM] hosts=1 instances=1 | Per-Host Resources: mem-estimate=176.00MB mem-reservation=0B | 01:SCAN HDFS [work.pk_table_pk pk, RANDOM] | partitions=1/1 files=1 size=2.51MB | predicates: pk.rn = 1, pk.tr_code IN ('D') | stats-rows=249225 extrapolated-rows=disabled | table stats: rows=249225 size=2.51MB | column stats: all | parquet statistics predicates: pk.rn = 1, pk.tr_code IN ('D') | parquet dictionary predicates: pk.rn = 1, pk.tr_code IN ('D') | mem-estimate=176.00MB mem-reservation=0B | tuple-ids=1 row-size=159B cardinality=88114 | 00:SCAN KUDU [base.large_table tab] mem-estimate=0B mem-reservation=0B tuple-ids=0 row-size=1.46KB cardinality=674357713 The query: select * from base.large_table tab join work.pk_table_pk pk ON tab.`key1` = pk.`key1` and tab.`key2` = pk.`key2` and tab.`key3` = pk.`key3` and tab.`key4` = pk.`key4` and tab.`root_key3` = pk.`root_key3` where pk.tr_code in ('D') and pk.rn =1 This is running on CDH 5.13. Could this be optimized in the future version of Impala? Thanks

Tomas79 · ‎10-01-2018

Side question: what is a difference between coordinator and executor?

Tomas79 · ‎10-01-2018

Cloud VM ami-3548444c plus yum -y update

Tomas79 · ‎10-01-2018

Hi Cloudera employees, please can somebody have a look at this error, why the new installation fails? OS: CentOS Linux release 7.5.1804 (Core) Thanks sudo pip install cloudera-director-python-client copying build/lib/cloudera/director/v9/models/warning_info.py -> build/bdist.linux-x86_64/wheel/cloudera/director/v9/models running install_egg_info Copying cloudera_director_python_client.egg-info to build/bdist.linux-x86_64/wheel/cloudera_director_python_client-6.0.0-py2.7.egg-info running install_scripts Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/pip-install-rWjXLN/cloudera-director-python-client/setup.py", line 47, in <module> """ File "/usr/lib64/python2.7/distutils/core.py", line 152, in setup dist.run_commands() File "/usr/lib64/python2.7/distutils/dist.py", line 953, in run_commands self.run_command(cmd) File "/usr/lib64/python2.7/distutils/dist.py", line 972, in run_command cmd_obj.run() File "/usr/lib/python2.7/site-packages/wheel/bdist_wheel.py", line 237, in run self.egg2dist(self.egginfo_dir, distinfo_dir) File "/usr/lib/python2.7/site-packages/wheel/bdist_wheel.py", line 344, in egg2dist pkg_info = pkginfo_to_metadata(egginfo_path, pkginfo_path) File "/usr/lib/python2.7/site-packages/wheel/metadata.py", line 88, in pkginfo_to_metadata for key, value in generate_requirements({extra: reqs}): File "/usr/lib/python2.7/site-packages/wheel/metadata.py", line 67, in generate_requirements for new_req in convert_requirements(depends): File "/usr/lib/python2.7/site-packages/wheel/metadata.py", line 36, in convert_requirements spec = requires_to_requires_dist(parsed_requirement) File "/usr/lib/python2.7/site-packages/wheel/metadata.py", line 21, in requires_to_requires_dist if requirement.url: AttributeError: Requirement instance has no attribute 'url' ---------------------------------------- Failed building wheel for cloudera-director-python-client Running setup.py clean for cloudera-director-python-client Failed to build cloudera-director-python-client Installing collected packages: urllib3, six, certifi, python-dateutil, cloudera-director-python-client Found existing installation: urllib3 1.10.2 Uninstalling urllib3-1.10.2: Successfully uninstalled urllib3-1.10.2 Found existing installation: six 1.9.0 Uninstalling six-1.9.0: Successfully uninstalled six-1.9.0 Running setup.py install for cloudera-director-python-client ... done Successfully installed certifi-2018.8.24 cloudera-director-python-client-6.0.0 python-dateutil-2.7.3 six-1.11.0 urllib3-1.23

Tomas79 · ‎09-26-2018

The first thing what came to my mind, do you have a valid ticket on all the spark/talend nodes? If not, then that is probably the root cause.

Tomas79 · ‎09-20-2018

Regarding the 1st point, are you using Sentry? I am quite sure, that in correct configuration the HDFS is not browsable from Hue if the owner/group does not match. Or.. maybe you have ACLs enabled on HDFS, and on this directory from the print screen there are more permissions. If you have acls enabled, then check it by hdfs dfs -getfacl <path>

Tomas79 · ‎09-20-2018

Try to use https://github.com/rcongiu/Hive-JSON-Serde#jsonserde---a-readwrite-serde-for-json-data

Tomas79 · ‎09-20-2018

Hi, I have tested the correlated and uncorrelated subqueries in Hive CDH 5.13 and you are right. CORRELATED: First the data and test in Impala: [10.197.0.0:21000] > create table work.tbla ( id int, s string ); [10.197.0.0:21000] > create table work.tblb ( id int, s string ); [10.197.0.0:21000] > insert into work.tbla select 100, '2008-01-01' union select 200, '2009-12-01'; Modified 2 row(s) in 3.69s [10.197.0.0:21000] > insert into work.tblb select 100, '2008-01-01' union select 200, '2009-12-01'; Modified 2 row(s) in 3.63s [10.197.0.0:21000] > select * from work.tbla a where exists ( select null from work.tblb b where a.id = b.id and b.s > '2009-10-01' ); +-----+------------+ | id | s | +-----+------------+ | 200 | 2009-12-01 | +-----+------------+ Now in Hive: 0: jdbc:hive2://xxx.com> explain select * from work.tbla a where exists ( select null from work.tblb b where a.id = b.id and b.s > '2009-10-01' ); .. INFO : OK +----------------------------------------------------+--+ | Explain | +----------------------------------------------------+--+ | STAGE DEPENDENCIES: | | Stage-4 is a root stage | | Stage-3 depends on stages: Stage-4 | | Stage-0 depends on stages: Stage-3 | | | | STAGE PLANS: | | Stage: Stage-4 | | Map Reduce Local Work | | Alias -> Map Local Tables: | | sq_1:b | | Fetch Operator | | limit: -1 | | Alias -> Map Local Operator Tree: | | sq_1:b | | TableScan | | alias: b | | filterExpr: ((s > '2009-10-01') and id is not null) (type: boolean) | | Statistics: Num rows: 1 Data size: 30 Basic stats: COMPLETE Column stats: NONE | | Filter Operator | | predicate: ((s > '2009-10-01') and id is not null) (type: boolean) | | Statistics: Num rows: 1 Data size: 30 Basic stats: COMPLETE Column stats: NONE | | Select Operator | | expressions: id (type: int) | | outputColumnNames: _col1 | | Statistics: Num rows: 1 Data size: 30 Basic stats: COMPLETE Column stats: NONE | | Group By Operator | | keys: _col1 (type: int) | | mode: hash | | outputColumnNames: _col0 | | Statistics: Num rows: 1 Data size: 30 Basic stats: COMPLETE Column stats: NONE | | HashTable Sink Operator | | keys: | | 0 id (type: int) | | 1 _col0 (type: int) | | | | Stage: Stage-3 | | Map Reduce | | Map Operator Tree: | | TableScan | | alias: a | | filterExpr: id is not null (type: boolean) | | Statistics: Num rows: 1 Data size: 30 Basic stats: COMPLETE Column stats: NONE | | Filter Operator | | predicate: id is not null (type: boolean) | | Statistics: Num rows: 1 Data size: 30 Basic stats: COMPLETE Column stats: NONE | | Map Join Operator | | condition map: | | Left Semi Join 0 to 1 | | keys: | | 0 id (type: int) | | 1 _col0 (type: int) | | outputColumnNames: _col0, _col1 | | Statistics: Num rows: 1 Data size: 33 Basic stats: COMPLETE Column stats: NONE | | File Output Operator | | compressed: false | | Statistics: Num rows: 1 Data size: 33 Basic stats: COMPLETE Column stats: NONE | | table: | | input format: org.apache.hadoop.mapred.TextInputFormat | | output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | | Local Work: | | Map Reduce Local Work | | | | Stage: Stage-0 | | Fetch Operator | | limit: -1 | | Processor Tree: | | ListSink | | | +----------------------------------------------------+--+ 69 rows selected (0.311 seconds) UNCORRELATED - Hive fails: 0: jdbc:hive2://xxxx.com> explain select * from work.tbla a where exists ( select null from work.tblb b where b.s > '2009-10-01' ); Error: Error while compiling statement: FAILED: SemanticException Line 1:65 Invalid SubQuery expression ''2009-10-01'' in definition of SubQuery sq_1 [ exists ( select null from work.tblb b where b.s > '2009-10-01' ) ] used as sq_1 at Line 1:39: For Exists/Not Exists operator SubQuery must be Correlated. (state=42000,code=40000) Impala ok [10.197.0.0:21000] > select * from work.tbla a where exists ( select null from work.tblb b where b.s > '2009-10-01' ); +-----+------------+ | id | s | +-----+------------+ | 100 | 2008-01-01 | | 200 | 2009-12-01 | +-----+------------+ Fetched 2 row(s) in 0.19s

Tomas79 · ‎09-20-2018

You have the explanation probably on the last line of the error message. Try to request the soft limits for the number of instances.

Online	Offline
Last Visited	‎01-14-2021 05:46 AM

Member Since	‎07-01-2015 06:03 AM
Last Visited	‎01-14-2021 05:46 AM
Posts	460
Kudos received	79

Cloudera Community

Re: Read service-wide configuration values via API

Re: Cloudera Altus - create CM with existing postg...

Re: Spark job getting failed with Jupyter notebook

Re: Create Parameterized view Impala

Re: Unable to access NameNode in cross realm trust...

Re: cloudera director python client 6.0.0 failed i...

Impala hash join optimization

Re: Impala ECONNRESET errors

Re: cloudera director python client 6.0.0 failed i...

cloudera director python client 6.0.0 failed insta...

Re: Impala JDBC Connection Failed Kerberos

Re: Hue and Oozie security - users can access reso...

Re: InvalidStorageDescriptorException: Impala

Re: Support uncorrelated subqueries in the WHERE c...

Re: Cloudera Director failing in aws