Member since
04-07-2016
36
Posts
4
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
14173 | 08-01-2016 11:30 AM |
03-23-2020
07:47 PM
Hi, I am using CDH 6.3.2. And I am currently implement a job that daily sync a folder in hdfs to s3. This fodler can have new files or modified files. But the -update options doesn't seems to be working. All the files in my "test" folder are gettin re-written every-time. Exemple If I dot this command once : hadoop distcp -update /user/maurin/test s3a://test_bucket/test/ ERROR: Tools helper /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/bin/../lib/hadoop/libexec//tools/hadoop-distcp.sh was not found. 20/03/23 19:36:37 WARN impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties 20/03/23 19:36:37 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s). 20/03/23 19:36:37 INFO impl.MetricsSystemImpl: s3a-file-system metrics system started 20/03/23 19:36:39 INFO Configuration.deprecation: fs.s3a.server-side-encryption-key is deprecated. Instead, use fs.s3a.server-side-encryption.key 20/03/23 19:36:40 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=true, deleteMissing=false, ignoreFailures=false, overwrite=false, append=false, useDiff=false, useRdiff=false, fromSnapshot=null, toSnapshot=null, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=0.0, copyStrategy='uniformsize', preserveStatus=[BLOCKSIZE], atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[/user/maurin/test], targetPath=s3a://test_bucket/test, filtersFile='null', blocksPerChunk=0, copyBufferSize=8192, verboseLog=false}, sourcePaths=[/user/maurin/test], targetPathExists=true, preserveRawXattrsfalse 20/03/23 19:36:42 INFO hdfs.DFSClient: Created token for maurin: HDFS_DELEGATION_TOKEN owner=maurin/lore_staff@net.getlore.io, renewer=yarn, realUser=, issueDate=1585017402455, maxDate=1585622202455, sequenceNumber=32271, masterKeyId=886 on ha-hdfs:nameservice1 20/03/23 19:36:42 INFO security.TokenCache: Got dt for hdfs://nameservice1; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:nameservice1, Ident: (token for maurin: HDFS_DELEGATION_TOKEN owner=maurin/lore_staff@net.getlore.io, renewer=yarn, realUser=, issueDate=1585017402455, maxDate=1585622202455, sequenceNumber=32271, masterKeyId=886) 20/03/23 19:36:42 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 4; dirCnt = 1 20/03/23 19:36:42 INFO tools.SimpleCopyListing: Build file listing completed. 20/03/23 19:36:42 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb 20/03/23 19:36:42 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor 20/03/23 19:36:42 INFO tools.DistCp: Number of paths in the copy list: 4 20/03/23 19:36:42 INFO tools.DistCp: Number of paths in the copy list: 4 20/03/23 19:36:43 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm756 20/03/23 19:36:43 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/maurin/.staging/job_1584390517558_0074 20/03/23 19:36:43 INFO mapreduce.JobSubmitter: number of splits:3 20/03/23 19:36:43 INFO Configuration.deprecation: yarn.resourcemanager.zk-address is deprecated. Instead, use hadoop.zk.address 20/03/23 19:36:43 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled 20/03/23 19:36:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1584390517558_0074 20/03/23 19:36:43 INFO mapreduce.JobSubmitter: Executing with tokens: [Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:nameservice1, Ident: (token for maurin: HDFS_DELEGATION_TOKEN owner=maurin/lore_staff@net.getlore.io, renewer=yarn, realUser=, issueDate=1585017402455, maxDate=1585622202455, sequenceNumber=32271, masterKeyId=886)] 20/03/23 19:36:43 INFO conf.Configuration: resource-types.xml not found 20/03/23 19:36:43 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'. 20/03/23 19:36:44 INFO impl.YarnClientImpl: Submitted application application_1584390517558_0074 20/03/23 19:36:44 INFO mapreduce.Job: The url to track the job: http://cdhmaster3.net.cuberonlabs.com:8088/proxy/application_1584390517558_0074/ 20/03/23 19:36:44 INFO tools.DistCp: DistCp job-id: job_1584390517558_0074 20/03/23 19:36:44 INFO mapreduce.Job: Running job: job_1584390517558_0074 20/03/23 19:36:52 INFO mapreduce.Job: Job job_1584390517558_0074 running in uber mode : false 20/03/23 19:36:52 INFO mapreduce.Job: map 0% reduce 0% 20/03/23 19:37:11 INFO mapreduce.Job: map 84% reduce 0% 20/03/23 19:37:13 INFO mapreduce.Job: map 100% reduce 0% 20/03/23 19:37:22 INFO mapreduce.Job: Job job_1584390517558_0074 completed successfully 20/03/23 19:37:22 INFO mapreduce.Job: Counters: 43 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=694053 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=1656 HDFS: Number of bytes written=0 HDFS: Number of read operations=35 HDFS: Number of large read operations=0 HDFS: Number of write operations=6 HDFS: Number of bytes read erasure-coded=0 S3A: Number of bytes read=0 S3A: Number of bytes written=4 S3A: Number of read operations=44 S3A: Number of large read operations=0 S3A: Number of write operations=33 Job Counters Launched map tasks=3 Other local map tasks=3 Total time spent by all maps in occupied slots (ms)=268400 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=53680 Total vcore-milliseconds taken by all map tasks=429440 Total megabyte-milliseconds taken by all map tasks=274841600 Map-Reduce Framework Map input records=4 Map output records=0 Input split bytes=354 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=364 CPU time spent (ms)=19010 Physical memory (bytes) snapshot=1625214976 Virtual memory (bytes) snapshot=18979409920 Total committed heap usage (bytes)=6963068928 Peak Map Physical memory (bytes)=556732416 Peak Map Virtual memory (bytes)=6332137472 File Input Format Counters Bytes Read=1298 File Output Format Counters Bytes Written=0 DistCp Counters Bandwidth in Btyes=0 Bytes Copied=4 Bytes Expected=4 Files Copied=3 DIR_COPY=1 20/03/23 19:37:22 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics system... 20/03/23 19:37:22 INFO impl.MetricsSystemImpl: s3a-file-system metrics system stopped. 20/03/23 19:37:22 INFO impl.MetricsSystemImpl: s3a-file-system metrics system shutdown complete. We can see that it copied 3 files. Then If I trigger it again: distcp -update /user/maurin/test s3a://test_bucket/test/ ERROR: Tools helper /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/bin/../lib/hadoop/libexec//tools/hadoop-distcp.sh was not found. 20/03/23 19:41:38 WARN impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties 20/03/23 19:41:38 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s). 20/03/23 19:41:38 INFO impl.MetricsSystemImpl: s3a-file-system metrics system started 20/03/23 19:41:41 INFO Configuration.deprecation: fs.s3a.server-side-encryption-key is deprecated. Instead, use fs.s3a.server-side-encryption.key 20/03/23 19:41:41 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=true, deleteMissing=false, ignoreFailures=false, overwrite=false, append=false, useDiff=false, useRdiff=false, fromSnapshot=null, toSnapshot=null, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=0.0, copyStrategy='uniformsize', preserveStatus=[BLOCKSIZE], atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[/user/maurin/test], targetPath=s3a://test_bucket/test, filtersFile='null', blocksPerChunk=0, copyBufferSize=8192, verboseLog=false}, sourcePaths=[/user/maurin/test], targetPathExists=true, preserveRawXattrsfalse 20/03/23 19:41:43 INFO hdfs.DFSClient: Created token for maurin: HDFS_DELEGATION_TOKEN owner=maurin/lore_staff@net.getlore.io, renewer=yarn, realUser=, issueDate=1585017703760, maxDate=1585622503760, sequenceNumber=32272, masterKeyId=886 on ha-hdfs:nameservice1 20/03/23 19:41:43 INFO security.TokenCache: Got dt for hdfs://nameservice1; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:nameservice1, Ident: (token for maurin: HDFS_DELEGATION_TOKEN owner=maurin/lore_staff@net.getlore.io, renewer=yarn, realUser=, issueDate=1585017703760, maxDate=1585622503760, sequenceNumber=32272, masterKeyId=886) 20/03/23 19:41:44 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 4; dirCnt = 1 20/03/23 19:41:44 INFO tools.SimpleCopyListing: Build file listing completed. 20/03/23 19:41:44 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb 20/03/23 19:41:44 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor 20/03/23 19:41:44 INFO tools.DistCp: Number of paths in the copy list: 4 20/03/23 19:41:44 INFO tools.DistCp: Number of paths in the copy list: 4 20/03/23 19:41:44 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm756 20/03/23 19:41:44 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/maurin/.staging/job_1584390517558_0075 20/03/23 19:41:44 INFO mapreduce.JobSubmitter: number of splits:2 20/03/23 19:41:44 INFO Configuration.deprecation: yarn.resourcemanager.zk-address is deprecated. Instead, use hadoop.zk.address 20/03/23 19:41:44 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled 20/03/23 19:41:45 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1584390517558_0075 20/03/23 19:41:45 INFO mapreduce.JobSubmitter: Executing with tokens: [Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:nameservice1, Ident: (token for maurin: HDFS_DELEGATION_TOKEN owner=maurin/lore_staff@net.getlore.io, renewer=yarn, realUser=, issueDate=1585017703760, maxDate=1585622503760, sequenceNumber=32272, masterKeyId=886)] 20/03/23 19:41:45 INFO conf.Configuration: resource-types.xml not found 20/03/23 19:41:45 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'. 20/03/23 19:41:45 INFO impl.YarnClientImpl: Submitted application application_1584390517558_0075 20/03/23 19:41:45 INFO mapreduce.Job: The url to track the job: http://cdhmaster3.net.cuberonlabs.com:8088/proxy/application_1584390517558_0075/ 20/03/23 19:41:45 INFO tools.DistCp: DistCp job-id: job_1584390517558_0075 20/03/23 19:41:45 INFO mapreduce.Job: Running job: job_1584390517558_0075 20/03/23 19:41:55 INFO mapreduce.Job: Job job_1584390517558_0075 running in uber mode : false 20/03/23 19:41:55 INFO mapreduce.Job: map 0% reduce 0% 20/03/23 19:42:14 INFO mapreduce.Job: map 65% reduce 0% 20/03/23 19:42:24 INFO mapreduce.Job: map 100% reduce 0% 20/03/23 19:42:33 INFO mapreduce.Job: Job job_1584390517558_0075 completed successfully 20/03/23 19:42:33 INFO mapreduce.Job: Counters: 43 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=462696 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=1259 HDFS: Number of bytes written=0 HDFS: Number of read operations=27 HDFS: Number of large read operations=0 HDFS: Number of write operations=4 HDFS: Number of bytes read erasure-coded=0 S3A: Number of bytes read=0 S3A: Number of bytes written=4 S3A: Number of read operations=43 S3A: Number of large read operations=0 S3A: Number of write operations=21 Job Counters Launched map tasks=2 Other local map tasks=2 Total time spent by all maps in occupied slots (ms)=217900 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=43580 Total vcore-milliseconds taken by all map tasks=348640 Total megabyte-milliseconds taken by all map tasks=223129600 Map-Reduce Framework Map input records=4 Map output records=0 Input split bytes=234 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=323 CPU time spent (ms)=14370 Physical memory (bytes) snapshot=1128972288 Virtual memory (bytes) snapshot=12650110976 Total committed heap usage (bytes)=4581752832 Peak Map Physical memory (bytes)=569790464 Peak Map Virtual memory (bytes)=6332850176 File Input Format Counters Bytes Read=1021 File Output Format Counters Bytes Written=0 DistCp Counters Bandwidth in Btyes=0 Bytes Copied=4 Bytes Expected=4 Files Copied=3 DIR_COPY=1 20/03/23 19:42:33 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics system... 20/03/23 19:42:33 INFO impl.MetricsSystemImpl: s3a-file-system metrics system stopped. 20/03/23 19:42:33 INFO impl.MetricsSystemImpl: s3a-file-system metrics system shutdown complete. We can see that it did the same operation again: Files Copied=3 Anything that I am missing to only copy the newly created files and replace the ones modified? thanks
... View more
Labels:
06-18-2018
01:19 PM
Hi, When doing a "show create table TableX", the columns are not escaped by "`" so if there is a column with a reserverd keyword ("comment" for exemple), we can't just copy past the query returned to an other database. Btw, I was thinking about opening a jira bug for that instead of posting it here, but I wasn't sure about the preffered way for you guys... let me know what is the prefered process for this kind on problems. thanks
... View more
Labels:
03-05-2018
04:04 PM
Hi, Impyla connect using kerberos, we are not using ldap. I have configured the Load Balancer as stated in the docs, but still have the same error. thanks
... View more
02-21-2018
03:48 PM
Hi, I have a imapala cluster with kerberos and HA proxy, and everything works fine when I connect using impyla. But when I do a (after a kinit) impala-shell -k connect myHaproxy:21051; I get : Error: Unable to communicate with impalad service. This service may not be an impalad instance. Check host:port and try again.
Traceback (most recent call last):
File "/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/bin/../lib/impala-shell/impala_shell.py", line 1554, in <module>
shell.cmdloop(intro)
File "/usr/lib/python2.7/cmd.py", line 142, in cmdloop
stop = self.onecmd(line)
File "/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/bin/../lib/impala-shell/impala_shell.py", line 563, in onecmd
return cmd.Cmd.onecmd(self, line)
File "/usr/lib/python2.7/cmd.py", line 221, in onecmd
return func(arg)
File "/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/bin/../lib/impala-shell/impala_shell.py", line 717, in do_connect
self._connect()
File "/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/bin/../lib/impala-shell/impala_shell.py", line 764, in _connect
result = self.imp_client.connect()
File "/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib/impala-shell/lib/impala_client.py", line 245, in connect
result = self.ping_impala_service()
File "/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib/impala-shell/lib/impala_client.py", line 250, in ping_impala_service
return self.imp_service.PingImpalaService()
File "/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib/impala-shell/gen-py/ImpalaService/ImpalaService.py", line 223, in PingImpalaService
return self.recv_PingImpalaService()
File "/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib/impala-shell/gen-py/ImpalaService/ImpalaService.py", line 238, in recv_PingImpalaService
raise x
thrift.Thrift.TApplicationException: Invalid method name: 'PingImpalaService' any idea why? thanks
... View more
12-20-2017
12:27 PM
Hi, I am trying to parse a date in impala using the unix_timestamp fonction. select unix_timestamp('Thu Dec 17 15:55:08 IST 2015', 'EEE MMM d HH:mm:ss ZZZ yyyy') But I get the follow error: Bad date/time conversion format: EEE MMM d HH:mm:ss ZZZ yyyy The same query works in hive. Is there something that I need to change the the pattern to make it work in impala? I am currently on cdh 5.12 thanks
... View more
Labels:
05-15-2017
09:15 PM
Hi Alex, Thanks for your answer. I am not a huge fan of giving the exact column names, but I can tell you that both tables are paritioned by data in string format "YYYY-MM-DD" and 1798 columns of which 1795 are strings and 3 ints. I hope that can help. thanks
... View more
05-15-2017
12:07 PM
Hi, I am created a view on top of a table and it give me really bad performance results. It take about 2min to "Plan" the query on the views whereas it is only taking 55ms in the raw table. The views is really jsut a select * from table_a union select * from table_b. Where table a and b have the same definition and are both parquet tables... Slow query: https://gist.github.com/anonymous/0af6f862a8069ee7b933deafe9cc0880 Fast query: https://gist.github.com/anonymous/2ef16a696527acd86795c3080edf3e91 Any idea what is happening? thanks
... View more
Labels:
04-21-2017
01:32 AM
thanks! If you open a jira, can you send me the link? I will probably disable codegen for now. And wait until you push a fix to re enable it. thanks
... View more
04-20-2017
11:10 PM
it seems to be coming from avro. I created the table as parquet and it took 0.48sec. The table have about 900 columns, so nothing to fancy. thanks
... View more
04-20-2017
09:56 PM
It is a string of that look like "YYYY-MM-DD" the table is stored as avro. I can try using parquet or text if you want
... View more
04-20-2017
09:08 PM
Hi, using impala 2.7(8) with cdh5.10.1 here. I am trying a simple query : `select distinct(date_col_partition) from table_1` and it is taking 20 sec. But When I do a set DISABLE_CODEGEN=true; It take only less than a second. here is the profle gist: https://gist.github.com/anonymous/1a5faa3a10d4495f7b8abc3c964457db Any idea of what is going wrong? thanks
... View more
03-20-2017
11:53 AM
" It's an unfortunate limitation that we are working on addressing " I would greatly appreciate if you could do it :), with a lambda architecture, most of my queries are hitting a "union" view. For Avro you are right, it was also because of the union, I just tried with MT_DOP and it gave me expected results. thanks
... View more
03-09-2017
03:40 PM
Hey. You are right for the last profile, my bad... Lets do all 4 uses cases this time. Here are the profiles on the raw table : no mt_dop on table : https://gist.github.com/momohuri/0b91468ae2526c4f5b0c4dba09172147 take about the same time as: mt_dop=10 on table: https://gist.github.com/momohuri/62fb85bb251490b01d0ee7032f377cef For the view no mt_dop (it still take the same amout of time as the table): https://gist.github.com/momohuri/5954233819c54376c85c795b006f91ad mt_dop=10 (it take 5x time): https://gist.github.com/momohuri/2263742803326fc2467e96a2e2a9cb5a and you are right on the observation, the main difference in the profile/summary that I see, is the time spent on hdfs. thanks PS: Also an other observation, if I create view of parquet + Avro, and query it with mt_dop=10,it seems to be doing the equivalent of mt_dop=0. I didn't ran extensive test on it. but that's my first impression
... View more
03-09-2017
01:45 PM
Hi, I have a view defined as `select * from t1 union select * from t2`. The 2 tables have an identical schema, t2 contains no data. Now 3 uses cases : - query with mt_dop=10 on the view : query is slow ~10s https://gist.github.com/momohuri/9ad4ba8f6fbd1d180068c8c102291f69 - query with mt_dop=0 on the view : query is fast ~1s https://gist.github.com/momohuri/c4347eb7ef70a8eec63a0a62638d1ce7 - query with mt_dop=10 not using a view/union : query is fast ~1sec https://gist.github.com/momohuri/62a00d9e381771aa9172b6dea09a5191 I hope those observations can help for the next releases 🙂 thanks
... View more
Labels:
03-07-2017
12:27 PM
Hi, actuay after investigation the problem was completely unraleted to impala... One of the machines of our cluster was having a 100mbps ethernet cable instead of a 10gbps... thanks for your help
... View more
03-01-2017
07:31 PM
Hi, I have setup a new cluster with pretty much the same coniguration as the prod, and similar amount of machines. The new cluster is using impala 5.10.0 the old one is using 5.9.1. This exemple is not as bad as I saw before but still... implala 2.7 :79.49s https://gist.github.com/momohuri/38e5cce6d4f4dc1c45ac6db18fbc1a82 impala 2.8 : 129.59s https://gist.github.com/momohuri/9544c5a97e9ec40ea1ec71caf1f5a030 query 2: impala 2.7 : 62s https://gist.github.com/momohuri/c11f5cc7dc336af5ad1b1b605c523a1a impala 2.8: 111s https://gist.github.com/momohuri/81586f032e24c3c530e49da75816acd3 The main difference that I see is the amount of hosts. in 2.8 it is only using 3 hosts, but there are 8 avaible. Is there a reasons for that? Is there something else that I am missing ?
... View more
02-16-2017
03:41 PM
Unfortunaly we inserted a lot of data in those partitions since yesterday... And I didn't downloaded the profile when I did my test on cdh 5.9.1, I just noted the time taken.
... View more
02-16-2017
01:16 PM
Hi, I tried to update to 5.10 yesterday, and I believe to impala 2.8 (the logs/shell still says it is impala 2.7 for cdh 5.10). And I got suprised with a big drop in performance for most of my queries. For queries with no join, using "set mt_dop=10", improved the performance by a lot. But for all the queries with "mt_dop=0" they got way worst. In the shell In runned a aggregation query with no join, and it "Fetched 360 row(s) in 100.28s", the summary is the following : +---------------------+--------+----------+----------+---------+------------+-----------+---------------+---------------------------------------------------+
| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail |
+---------------------+--------+----------+----------+---------+------------+-----------+---------------+---------------------------------------------------+
| 10:MERGING-EXCHANGE | 1 | 150.62us | 150.62us | 360 | 500 | 0 B | -1 B | UNPARTITIONED |
| 05:TOP-N | 9 | 137.12us | 218.93us | 360 | 500 | 16.00 KB | 11.72 KB | |
| 09:AGGREGATE | 9 | 10.72ms | 14.33ms | 360 | 6.15M | 10.89 MB | 17.20 MB | FINALIZE |
| 08:EXCHANGE | 9 | 86.09us | 95.65us | 3.24K | 6.15M | 0 B | 0 B | HASH(`date (hourly)`) |
| 04:AGGREGATE | 9 | 364.87ms | 384.52ms | 3.24K | 6.15M | 2.03 MB | 17.20 MB | STREAMING |
| 07:AGGREGATE | 9 | 5.27s | 5.52s | 38.07M | 226.95M | 714.15 MB | 18.60 GB | FINALIZE |
| 06:EXCHANGE | 9 | 608.64ms | 628.24ms | 115.78M | 226.95M | 0 B | 0 B | HASH(udid,`date (hourly)`) |
| 03:AGGREGATE | 9 | 17.56s | 24.24s | 115.78M | 226.95M | 2.63 GB | 18.60 GB | STREAMING |
| 00:UNION | 9 | 1.67s | 2.47s | 226.95M | 226.95M | 1.93 MB | 0 B | |
| |--02:SCAN HDFS | 9 | 543.42us | 602.49us | 0 | 0 | 0 B | 0 B | pocketgems_prod.customevent_chapterview_streaming |
| 01:SCAN HDFS | 9 | 291.51ms | 1.09s | 226.95M | 226.95M | 42.01 MB | 1.29 GB | pocketgems_prod.customevent_chapterview_batch |
+---------------------+--------+----------+----------+---------+------------+-----------+---------------+---------------------------------------------------+ It use to take around 35 seconds. I also downloaded the profile : https://gist.github.com/momohuri/c03683cd4263f48c1de5afd314d2662f The thing that surprised me is this RowsReturnedRate : 3. Any clue of why is it happening? For now I went back to cdh 5.9.1 Thanks
... View more
Labels:
01-26-2017
10:35 AM
where you able to look at the profile by any chance? thanks
... View more
01-25-2017
11:02 AM
Hi, I am trying to run the query by directly connecting on impala unsing impala-shell in one of the daemons machines. I only use HA proxy with no load balancer. this is waht I get with the profile : compute incremental stats my_table;
Query: compute incremental stats my_table
WARNINGS:
Memory limit exceeded
The memory limit is set too low to initialize spilling operator (id=3). The minimum required memory to spill this operator is 272.00 MB.
Column some_column_name does not have statistics, recomputing stats for the whole table
[my_machine:21000] > profile;
Query Runtime Profile:
Query (id=c2428c691af2dcaa:55837fed00000000):
Summary:
Session ID: bb4992202447c47e:fe9f65a64f6da581
Session Type: BEESWAX
Start Time: 2017-01-25 10:58:31.528250000
End Time: 2017-01-25 10:59:00.884573000
Query Type: DDL
Query State: EXCEPTION
Query Status:
Memory limit exceeded
The memory limit is set too low to initialize spilling operator (id=3). The minimum required memory to spill this operator is 272.00 MB.
Impala Version: impalad version 2.7.0-cdh5.9.1 RELEASE (build 24ad6df788d66e4af9496edb26ac4d1f1d2a1f2c)
User: my_user
Connected User: my_user
Delegated User:
Network Address: ::ffff:172.16.0.221:46893
Default Db: my_db
Sql Statement: compute incremental stats my_table
Coordinator: my_machine:22000
Query Options (non default):
DDL Type: COMPUTE_STATS
: 0.000ns
Query Timeline: 29s356ms
- Start execution: 66.160us (66.160us)
- Planning finished: 28.190ms (28.124ms)
- Request finished: 29s090ms (29s062ms)
- Unregister query: 29s356ms (265.969ms)
ImpalaServer:
- ClientFetchWaitTimer: 0.000ns
- RowMaterializationTimer: 0.000ns
... View more
01-24-2017
12:24 PM
Hi, I am trying to compute incremental stats for one large table (~200gb). But I have an out of memory error : Memory limit exceeded The memory limit is set too low to initialize spilling operator (id=3). The minimum required memory to spill this operator is 272.00 MB. It is a little bit strange to see that because I have a memory limit in the daemon and in the shell set to 80gb. But anyway, I investigated a little bit more and found this in the logs : W0124 12:13:21.800235 6746 HdfsScanNode.java:654] Per-host mem cost 8.25GB exceeded per-host upper bound 7.50GB. I got the same error if I do it for the all the table or just one partition at the time. I couldn't find the parameter to increase HdfsScanNode uper bound. Any idea on how I could solve that? thanks ps: I am using cdh 5.9.1
... View more
12-02-2016
02:04 PM
Thanks for you answer! I actualy want to group by all the possible mask of a single 1 bit. thanks to you I think I have a solution: I will first construct a table with all my masks (1,2,4,8,...) and then do a join with my table with an "and" operator. That should "epxlode" my rows to have all possible groups filtering all "bit"=0 to only have the valid groups. Not exactly sure about how my query will look like yet... But i think it should be possible that way thanks
... View more
12-01-2016
09:03 PM
Hi, I have a table with two fields : "user","some_integer" And I was wondering if there is a way of doing something like: select count(distinct("user")), "bit" from random_table group by bit(some_number); Thanks
... View more
Labels:
11-14-2016
02:35 PM
2 Kudos
Hi, I rebooted the machine that host cloudera manager. But after the reboot I got a problem. In parcels I see that cdh: 5.9. 0-1.cdh5.9.0.p0.23 is beeing distributed, but it is stuck in one machine in the activation step. I tried to restart and hard restart the agent and the server, but I always have the same problem. The machine it is stuck on the activation is the same machine as the one that have cloudera manager, so there shouldn't be any network problem. The agent log keep looping over the same logs : [14/Nov/2016 14:30:08 +0000] 24070 MainThread client_configs ERROR Failed to deploy client config <hadoop-conf,/etc/hadoop/conf.cloudera.hdfs>: No parcel provided required tags: set([u'cdh'])
Traceback (most recent call last):
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.9.0-py2.7.egg/cmf/client_configs.py", line 769, in rectify
deploy_path = self._deploy_client_config(new_ccs[key])
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.9.0-py2.7.egg/cmf/client_configs.py", line 479, in _deploy_client_config
env, parcels_in_use = self._adapt_cc_to_env(self.deploy_env, cc)
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.9.0-py2.7.egg/cmf/client_configs.py", line 670, in _adapt_cc_to_env
cc["optional_tags"])
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.9.0-py2.7.egg/cmf/parcel.py", line 383, in prepare_environment
raise ParcelTagUnsatisfiedException("No parcel provided required tags: %s" % (missing_reqs,))
ParcelTagUnsatisfiedException: No parcel provided required tags: set([u'cdh'])
[14/Nov/2016 14:30:08 +0000] 24070 Thread-13 https ERROR Failed to retrieve/stroe URL: http://my_server:7180/cmf/parcel/download/CDH-5.9.0-1.cdh5.9.0.p0.23-trusty.parcel.torrent -> /opt/cloudera/parcel-cache/CDH-5.9.0-1.cdh5.9.0.p0.23-trusty.parcel.torrent HTTP Error 404: Not Found
Traceback (most recent call last):
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.9.0-py2.7.egg/cmf/https.py", line 175, in fetch_to_file
resp = self.open(req_url)
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.9.0-py2.7.egg/cmf/https.py", line 170, in open
return self.opener(*pargs, **kwargs)
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.9.0-py2.7.egg/cmf/https.py", line 205, in http_error_default
raise e
HTTPError: HTTP Error 404: Not Found
[14/Nov/2016 14:30:08 +0000] 24070 Thread-13 downloader ERROR Failed fetching torrent: HTTP Error 404: Not Found
Traceback (most recent call last):
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.9.0-py2.7.egg/cmf/downloader.py", line 263, in download
cmf.https.ssl_url_opener.fetch_to_file(torrent_url, torrent_file)
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.9.0-py2.7.egg/cmf/https.py", line 175, in fetch_to_file
resp = self.open(req_url)
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.9.0-py2.7.egg/cmf/https.py", line 170, in open
return self.opener(*pargs, **kwargs)
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.9.0-py2.7.egg/cmf/https.py", line 205, in http_error_default
raise e
HTTPError: HTTP Error 404: Not Found
[14/Nov/2016 14:30:23 +0000] 24070 MainThread parcel INFO The following requested parcels are not available: {u'CDH': u'5.9.0-1.cdh5.9.0.p0.23'} Any idea ? thanks
... View more
Labels:
11-01-2016
09:31 PM
Hi, quick question on performance, if I have 2 tables, the first one with columns "a,b" and the second one with columns "c,d" and I create a view like the following : CREATE VIEW my_view AS (
select a,b,null,null from table_1
union
select null,null,c,d from table_2) Now if I do a simple query like : select a from my_view Will the query only read from table 1 or the entire table_2 will also be scanned? (I am mainly worried about disk reads) Thanks
... View more
- Tags:
- impala
- performance
09-29-2016
12:37 PM
Hi,
I just saw that blog post about spark 2.0 beta : http://blog.cloudera.com/blog/2016/09/apache-spark-2-0-beta-now-available-for-cdh/
And I have a quick question, once spark 2.0 is installed on the cluster, how do I pick If i want my job to go to spark2.0 or 1.6?
thanks
... View more
Labels:
09-26-2016
12:52 PM
Hi, I upgraded impala to 2.6. The query aggregation improved by about 15%. I there a open ticket or an expected release date/version for the "full parallelization" ? thanks
... View more
09-22-2016
07:23 PM
Hi, I will update to 2.6 over the week end and post the results. I have 32 cores per hosts available to impala daemon. If you say that 10 million record are being process in parallel, I guess you imply that only one core is used by host (268M rows/6hosts/4 sec = ~11million). Is it expected to have only 1 core use per Node ? Did I miss something in the configuration? Or is it because of the multi-threaded aggregation improvement that you are working on ? I just want to make sure I didn't miss any obvious optimization. And just to tell you the column is of type "string". thanks
... View more
09-21-2016
03:43 PM
Hi, I am running impala 2.5 on cdh 5.7.3. I am currently bechmarking a simple query : select count(*),`session_id` from flat_table group by `session_id` limit 10; Here is the results of 'summary' : +--------------+--------+----------+----------+---------+------------+-----------+---------------+-----------------------------------------+
| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail |
+--------------+--------+----------+----------+---------+------------+-----------+---------------+-----------------------------------------+
| 04:EXCHANGE | 1 | 13.63us | 13.63us | 10 | 10 | 0 B | -1 B | UNPARTITIONED |
| 03:AGGREGATE | 6 | 1.11s | 1.15s | 60 | 247.06M | 171.09 MB | 128.00 MB | FINALIZE |
| 02:EXCHANGE | 6 | 86.76ms | 92.08ms | 12.94M | 247.06M | 0 B | 0 B | HASH(session_id) |
| 01:AGGREGATE | 6 | 4.07s | 6.14s | 12.94M | 247.06M | 525.03 MB | 128.00 MB | STREAMING |
| 00:SCAN HDFS | 6 | 337.83ms | 494.40ms | 268.67M | 247.06M | 145.36 MB | 88.00 MB | flat_table |
+--------------+--------+----------+----------+---------+------------+-----------+---------------+-----------------------------------------+ We can easily see that most of the time is going into the aggrerate part. And I have a lot of query that have the same botleneck. I have control over hardware and impala configuration. The table is parquet table, cached in hdfs and with incremental stats for each partition. Am I missing something or is this expected performances for a query like this? Thanks
... View more