Member since
09-25-2016
34
Posts
1
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5857 | 08-24-2017 09:36 AM | |
2876 | 08-17-2017 08:57 AM |
06-19-2019
10:23 AM
We do have load balancer in Impala. In our case issue happen to be cross referenced VIP in different data center causing load on metadata servers. But still some of the capabilities of metadata status is either missing or undocumented or may be I'm unaware of. SYNC_DDL makes the DDL query extremely slow ... the performance drops from something that runs in 3 seconds to 5 minutes. We have 30 node cluster and I am hoping SYNC_DDL doesn't mean sequenctial execution of DDL ( even that doens't ad up ). Is there a way to identify which node needs metadata refresh ? ( or which impalad has invlid metadata .... i.e. time when last metadata refresh occured ? )
... View more
06-11-2019
10:32 PM
Environment : CDH 5.15 Impala version : impalad version 2.12.0-cdh5.15.0 RELEASE OS: Centos 6.10 Table size : 88TB Partitions : 7K Type : Parquet, file size compacted 256MB We ingest data every minute to the table partition and run refresh table to load the data. There is a separate compaction process that runs every hour and merges smaller files into big. The set up was working fine for months until recently we are running into a strange issue of inconsistent behavior between few nodes. Randomly some nodes appears to have incosistent metadata i.e. even though refresh table command ran successfully some nodes still didn't have correct files so they referred older files for those partitions. We tried invalidating metadata ( followed by describe table to fix metada) but it didn't help. Even re-running refresh doesn't help all the time. We need some help/points to figure out the issue. * Is there a way to check if all Impala nodes have stale metadata ? * How to fix metadata for individual node ?Is there a command ? * Anyone has faced similar issue ? Can you share your experience and fix ?
... View more
Labels:
07-13-2018
11:02 AM
We've requirement fro low latency data availability. So there is a pressure to run this even more frequently not less. Would it help if we allocate more memory to catalog service or statestore service ?
... View more
07-12-2018
09:16 PM
We're a streaming application that's writes parquet files to HDFS to a partitioned ( partitioned by day and one more custom integer customer id) impala folder. We need to run refresh table in order to make Impala aware of the new files. The files are generated every minute and we run refresh table command every 2 minutes. https://www.cloudera.com/documentation/enterprise/latest/topics/impala_refresh.html We're two options 1) Run "refresh table <table name>" or 2) use "refresh table partition <partition spec>" ( available in CDH 5.11 / 5.10 onwards that refresh perticular partition. In terms of total time taken; "Refresh table <table name>" is very efficient in terms of time taken. It takes ~ 20 seconds or someting vs 5-7 seconds for each partition using "refresht able <table name > <partition spec>. I'd like to ask community and especially Impala team; what is recommanded to use in use case like ours. Running 30 individual refresh every minute or running one ? Or is there a third option that we don't know about ?
... View more
Labels:
07-12-2018
04:11 PM
We've few hundred users of our system and we see serious performance degradation when concurrent queries exceeds just mere 10+. I think it might be some configuration we're missing here either in config or the way we're managing the cluster. AverageScannerThreadConcurrency affecting query performance seriously. The same query when run under bit of load ( just couple of other big queries are running ). Same query that scan approximately 800 G of data runs fast without load vs super slow ( 20 min ) under load. AverageScannerThreadConcurrency: 28.664101859697162 (fast) AverageScannerThreadConcurrency: 1.5204863450230135 (slow) Any suggestions on how can we we infludence scanner thread concurrency to improve HDFS scan ?
... View more
05-20-2018
01:08 PM
We're using Apache Apex for streaming and trying to experiment ingestion to move to Spark Streaming.
... View more
03-29-2018
10:17 PM
Thanks but Iogs are not the issue here. I think the issue is Time-Series Storage. The configuration attribute firehose_time_series_storage_bytes controls the disk usage but the minimum value that can be set is 10GB. Is there a way to override this value ?
... View more
01-22-2018
07:32 PM
Looks like by default disk space used by service monitor and host monitor is huge ! Several GB. They appear in some /var/lib/cloudera-service-monitor and /var/lib/cloudera-host-monitor directory. Most space was taken by folders ts and type in both case. Is there a way to configure these two services to use less pace in development and test environment ?
... View more
Labels:
12-16-2017
10:25 PM
Oh ...I might have missed something. Just wanted to make sure I am not missing anything. 1. Add --insert_inherit_permissions=true for impalad safety valve 2. Set partition directory permission to 764 ( or whatever required ? ) 3. Insert into partition directory 4. Check file permission ; it should be same as folder ? 764 ?
... View more
12-06-2017
12:37 PM
Environment CDH 5.12 When running INSERT query on the table all files are always owned by user impala i.e. 744, all for impala read for everyone else . We have externally running compaction process which needs to read/write/replace this files. Is there a way to change this default behavior to have the file permission different than 744 ? I'd prefer if it's 764 ( group read/write ) so we can add the user to the same group as Impala who run the compaction process I tried change Impala Daemon Environment Advanced Configuration Snippet (Safety Valve) property and added --insert_inherit_permissions=true The upstream directory was 774 but files created were still 744. So other users can not write/edit those files.
... View more
Labels:
11-22-2017
12:22 PM
Env : CDH 5.12
We're using Cloudera Manager to configure and use HUE.
We're using HUE to expose ad hoc query to our cluster for occasional debugging for Impala and HIVE. We would want to configure HUE to add more editors to access other data systems. HUE documentation has guidance on how to add more editors but we have trouble figuring out to this using cloudera manager. How can we translate this to cloudera manager managed HUE configs ?
http://gethue.com/custom-sql-query-editors/
... View more
Labels:
10-25-2017
12:04 PM
# Create parquet Impala table temp with a column a # write parquet file using streaming applicaiton/ map reduce job call parquet schema for that #Impala select a from default.temp works and returns data #hive select a from default.temp returns null because it tries to reference column name from parquet schema I think and it doesn't match. Is there a way to force hive to read column name from metastore instead of parquet schema ?
... View more
09-27-2017
12:11 PM
I was able to finally figure this ... - Upgrading to CDH 5.12 helped. Earlier we used 5.7 which appeared to have some bug with wide table scan. - Sorting is an issue.But it has it's own limitations .. If sorting is removed and limit clause is removed. It takes long time to download the data for obvious reasons. Large data takes time to download. It starts streamign data instantly though. Removing sorting but with limit obviously makes the result unpredictable and inconsistent. - When sorting is applies and used with limit and offset clause. It appeared from profile that each node sorts the data that it has scanned and sends top ( 100 , i.e. limit 100 ) to coordinator. Coordinator than collects all the top results and sends top 100 to client. - Event though I added partition key to sort statement. The query appearts to not use that i.e. it still tries to scan all records from all partitions before sending it to coordinator. in other words adding partition key to sorting or removing it didn't make any difference. I am not sure but Impala could have done something smarter here to make query run faster.
... View more
09-17-2017
11:59 PM
It is parquet table. But it also has lot of rows. If my query has to scan last 6 months of data ( 2.5 Billion rows) ; I'm using order by time clause ( where time is a column of type timestamp), it takes 2 minutes ( after reduced my columns to retrieve only 5 columns ). With all colums it takes like 5 + minutes. I also tried reducing the columns to 200 - 300 but stil the performance of the query is still slow.
... View more
09-17-2017
10:23 PM
We have an requirement to select 100 rows from a table from a perticular range of partitions. It's a wide table with 800 columns. One of the columns in the table is precise timestamp of the record ( a column of type TIMESTAMP). Also there is a partition predicate day ( yyyymmdd INT ). User often select range of days and try to find the top 100 ( order by exact timestamp ). If I run query like select * from table where yyyymmdd between now() and now()-3 months order by time top 100 This query runs extremely slow. Reducing the number helps. I also tried adding partition yyyymmdd to the query's order by ; thinking query planner might use it to only find results by partition and won't wait for results from other partitions if it gets 100 from first one. But I didn't see it working. Any tricks/tips to make this query faster ? Wide table with 100s of millions of rows.
... View more
Labels:
09-15-2017
04:28 PM
Has anyone successfully run/installed impala-shell on mac ? I've copied impala-shell from /usr/bin and impala-shell directory from /usr/lib to my home directory. When I'm trying to launch impala-shell I get this error. I tried pip install sasl but it didn't solve the problem. I'm running python 2.7.11 in my machine. macofsunil:bin sparmar$ ./impala-shell Traceback (most recent call last): File "/Users/sparmar/bin/../lib/impala-shell/impala_shell.py", line 34, in <module> from impala_client import (ImpalaClient, DisconnectedException, QueryStateException, File "/Users/sparmar/lib/impala-shell/lib/impala_client.py", line 16, in <module> import sasl File "/Users/sparmar/lib/impala-shell/ext-py/sasl-0.1.1-py2.6-linux-x86_64.egg/sasl/__init__.py", line 1, in <module> File "/Users/sparmar/lib/impala-shell/ext-py/sasl-0.1.1-py2.6-linux-x86_64.egg/sasl/saslwrapper.py", line 7, in <module> File "/Users/sparmar/lib/impala-shell/ext-py/sasl-0.1.1-py2.6-linux-x86_64.egg/_saslwrapper.py", line 7, in <module> File "/Users/sparmar/lib/impala-shell/ext-py/sasl-0.1.1-py2.6-linux-x86_64.egg/_saslwrapper.py", line 6, in __bootstrap__ ImportError: dlopen(/Users/sparmar/.python-eggs/sasl-0.1.1-py2.6-linux-x86_64.egg-tmp/_saslwrapper.so, 2): no suitable image found. Did find: /Users/sparmar/.python-eggs/sasl-0.1.1-py2.6-linux-x86_64.egg-tmp/_saslwrapper.so: unknown file type, first eight bytes: 0x7F 0x45 0x4C 0x46 0x02 0x01 0x01 0x00 /Users/sparmar/.python-eggs/sasl-0.1.1-py2.6-linux-x86_64.egg-tmp/_saslwrapper.so: unknown file type, first eight bytes: 0x7F 0x45 0x4C 0x46 0x02 0x01 0x01 0x00 macof
... View more
Labels:
09-15-2017
09:43 AM
Finding logs manually in machine sound very brute force; I was thinking more of an API or CLI option to find logs Anyway the main issue we're trying to solve is access to logs to all developers in prod environment. Our node managers are behind the bars and not accessible ( any port or web ) to develoeprs and it's unlikely to happen. So we're trying to find a way to proxy the logs. I discovered that there is a jobhistory proxy to look at completed jobs / yarn apps but I coudln't get it working for running app. Is there any trick / way to access running app's logs like above ? http://resourcemanager.xyz.com:19888/jobhistory/logs//dataNode.com:8041/container_id_000001/container_id_000001/root
... View more
09-11-2017
09:57 PM
Is there a YARN API or command to know path to yarn logs location on disk for given container and application id ? Also want to add; we don't have log aggregation working and I'm perticularly looking for direct physical link to the file not the web interface. Thanks, Sunil
... View more
Labels:
09-08-2017
04:17 PM
We're also facing same issue ... and any pointers will be useful. The issue is by default dataframe assigns null values to non existing fields. The problem is there could be a valid use case where and upsert statement wants to actually update the value of a column to null i.e. delete the value. So I think the issue is not with KuduContext but with DataFrame. I'm a Spark newbie; is there a way to control how DataFrame is created ?
... View more
08-24-2017
03:53 PM
I have a table with two columns test, id and yyyymmdd and group_id are my partition columns. When I run following query it runs fast as it only scans 1 partition. select id, yyyymmdd, group_id, test from dwh.table where (id='1a' and yyyymmdd=20170815 and group_id=1) But when I tried to run following . It scans entire table. Even explain shows that it is going to perform full table scan. I think I'm missing somethign simple here. Appreciate community's help to find out what I'm missing here ? select id, yyyymmdd, group_id, test from dwh.table where (id='1a' and yyyymmdd=20170815 and group_id=1) OR (id='2b' and yyyymmdd=20170811 and group_id=2) How to scan two rows in two different partitions ?
... View more
Labels:
08-24-2017
09:36 AM
Using cloudera manager goto Sentry->Configurations Add users/groups to following property to allow them create/show roles. Smaller fonts are property name in the configuration file while regular fonts are display name of the property in the CM. Admin Groups sentry.service.admin.group Allowed Connecting Users sentry.service.allow.connect
... View more
08-18-2017
12:31 AM
I'm using Sentry service using Cloudera manager. I just realized that I can other users / groups to sentry config in cloudera manager and allow them to run Grant / Create role commands.
... View more
08-17-2017
08:57 AM
Actualy I figured out. I had to configure Impala to allow user ldaptest to impersonate as user cloudera ( hue login). I appended this to the cloudera manager property Proxy User Configuration ( authorized_proxy_user_config ) hue=*;ldaptest=cloudera So user hue can impersonate anyone and user 'ldaptest' can impersonate as 'cloudera'.
... View more
08-16-2017
05:20 PM
Impala appears to change my view definition without any warning. This results in different results if I run query vs view. Is this a bug ? It's creating lot of problems for us; Is there any workaround ? create view test1 as select * from( select request_id, date_time, first_value(created_user ignore nulls) over (partition by request_id order by date_time desc rows between unbounded preceding and unbounded following) as created_user, row_number() over (partition by request_id order by date_time desc) as rn from dwh.event_update ) t where rn=1 Run create show create view and it's missing 'ignore nulls' statements. Query: show create view test1 CREATE VIEW dwh.test1 AS SELECT * FROM (SELECT request_id, date_time, first_value(created_user) OVER (PARTITION BY request_id ORDER BY date_time DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) created_user, row_number() OVER (PARTITION BY request_id ORDER BY date_time DESC) rn FROM dwh.event_update) t WHERE rn = 1
... View more
Labels:
08-16-2017
05:12 PM
Yeah tried this URL on both versions; it fails with exception. "jdbc:impala:/cluster.threatmetrix.com:21050/dwh;AuthMech=2"; java.sql.SQLNonTransientConnectionException: [Simba][JDBC](10100) Connection Refused: [Simba][JDBC](11640) Required Connection Key(s): UID; [Simba][JDBC](11480) Optional Connection Key(s): AsyncExecPollInterval, CatalogSchemaSwitch, DefaultStringColumnLength, DelegationUID, LowerCaseResultSetColumnName, OptimizedInsert, PreparedMetaLimitZero, RowsFetchedPerBlock, SocketTimeOut, ssl, StripCatalogName, SupportTimeOnlyTimestamp, UseCustomTypeCoercionMap, UseNativeQuery, UseSasl at com.cloudera.exceptions.ExceptionConverter.toSQLException(Unknown Source) at com.cloudera.jdbc.common.BaseConnectionFactory.checkResponseMap(Unknown Source) at com.cloudera.jdbc.common.BaseConnectionFactory.doConnect(Unknown Source) at com.cloudera.jdbc.common.AbstractDriver.connect(Unknown Source) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:270) at impala_jdbc_test.ImpalaJDBCTestBud.main(ImpalaJDBCTestBud.java:59) Exception in thread "main" java.lang.NullPointerException
... View more
08-16-2017
03:21 PM
Actually I tried #1 with CDH 5.7 and it didn't work as well but #2 worked with 5.7 and stopped working on 5.12
... View more
08-16-2017
02:39 PM
Environment CDH 5.12, OPEN LDAP We've enabled LDAP auth on Impala and it's working fine except in HUE. When I try to launch HUE/Impala Editor it fails with this error in GUI. We have configured safety valve in HUE with this. [desktop] ldap_username=ldaptest ldap_password=ldaptest I'm logging into HUE as user cloudera ( FYI ; we don't have LDAP enabled on HUE ; cloudera is just a user managed within HUE ) User 'ldaptest' is not authorized to delegate to 'cloudera'. Bad status for request TOpenSessionReq(username='hue', password=None, client_protocol=6, configuration={'idle_s ession_timeout': '3600', 'impala.doas.user': u'cloudera'}): TOpenSessionResp(status=TStatus(errorCode=None, errorMessage="User 'ldaptest' is not authorized to delegate to 'cloudera'.\n", sqlState='HY000', infoMessages=None, statusCode=3), sessionHandle=TSessionHandle(sessionId=THandleIdentifier(secret='\x06\xd1\xc8\xe5\xd2\xc1Ck\xbd\xc7\xc5\xdb\xc5\x12\xdb\x8b', guid='*QiZ\xb0\xc7H\x0f\x8c5\xec\x14\xdf*7H')), configuration=None, serverProtocolVersion=5) How can I enable user ldaptest to be able to delegate to cloudera ?
... View more
08-11-2017
11:13 PM
We're blocked here. Is there a way to make any other users besides Impala, Hive role admin ? i.e. grant access to show and create roles ?
... View more
08-10-2017
11:37 PM
Even though user has ALL priviledges with grant option set to true, can not create /show roles. How to create a role/ assign priviledge to create/show roles to a user/group ? My set up CDH 5.12. Impala with Sentry (service) enabled. [myserver.com:21000] > version;
Shell version: Impala Shell v2.9.0-cdh5.12.0 (03c6ddb) built on Thu Jun 29 04:17:31 PDT 2017
Server version: impalad version 2.9.0-cdh5.12.0 RELEASE (build 03c6ddbdcec39238be4f5b14a300d5c4f576097e) Roles and users set up [myserver.com:21000] > show grant role admin;
Query: show grant role admin
+--------+----------+-------+--------+-----+-----------+--------------+-------------------------------+
| scope | database | table | column | uri | privilege | grant_option | create_time |
+--------+----------+-------+--------+-----+-----------+--------------+-------------------------------+
| SERVER | | | | | ALL | true | Fri, Aug 11 2017 05:55:28.694 |
+--------+----------+-------+--------+-----+-----------+--------------+-------------------------------+
Fetched 1 row(s) in 0.01s [myserver.com:21000] > show current roles; Query: show current roles +--------------+ | role_name | +--------------+ | admin | +--------------+ Fetched 1 row(s) in 0.01s Exception when user tries to run show roles or create roles. [myserver.com:21000] >show roles;
Query: show roles
ERROR: AuthorizationException: User 'sunil' does not have privileges to access the requested policy metadata or Sentry Service is unavailable.
... View more
08-07-2017
09:42 AM
Anyone ? In a nutshell. CDH 5.12 Impala with sentry (Service ) enabled. Impala JDBC Driver 3.5.38 ( latest ) AuthMech=2 hangs the getConnection on client. Without authmech , server logs complains about empty user.
... View more