Member since
09-28-2015
65
Posts
14
Kudos Received
9
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2919 | 01-30-2018 02:53 PM | |
1607 | 08-08-2017 08:52 AM | |
7301 | 07-27-2017 11:33 AM |
01-08-2019
08:55 PM
1 Kudo
In such a small cluster I'd definitely consider doubling up masters and tservers on all of the master nodes (ie 3 masters and 5 tservers). The master is pretty light weight and can be colocated with tservers for such a small workload. This way you'll get better fault tolerance and also better performance vs using 2/5 of the nodes mostly unutilized. -Todd
... View more
07-05-2018
01:37 PM
Typically the default encoding (BITSHUFFLE) works very well for timestamps, so you shouldn't need to tweak it. -Todd
... View more
07-05-2018
01:17 PM
Hi, Kudu can certainly scale to tens of thousands of point queries per second, similar to other NoSQL systems. For example, in preparing the slides posted on https://kudu.apache.org/2017/10/23/nosql-kudu-spanner-slides.html I ran a random-read benchmark using 5 16-core GCE machines and got 12k reads/second. Since then we've made significant improvements in random read performance and I expect you'd get much better than that if you were to re-run the benchmark on the latest versions. In a more recent benchmark on a 6-node physical cluster I was able to achieve over 100k reads/second. Keep in mind that such numbers are only achievable through direct use of the Kudu API (i.e Java, C++, or Python) and not via SQL queries through an engine like Impala or Spark. Typically those engines are more suited towards longer (>100ms) analytic queries and not high-concurrency point lookups. -Todd
... View more
05-21-2018
09:23 AM
1 Kudo
The 'data size' is just the underlying columnar data blocks, with compression. The total 'on disk size' is inclusive of some other structures like bloom filters (approximately 10 bits per row) as well as the synthetic composite key column. If you have 8 int64s as your primary key, this column would be about 64 bytes per row prior to compression. Depending on the cardinalities of these columns it's quite possible that they compress poorly. -Todd
... View more
05-10-2018
09:53 AM
Hi, Is it possible to reproduce this with a smaller sample data set that you can share? I'm not aware of any such bugs but it's possible you've discovered something new. -Todd
... View more
04-02-2018
10:09 AM
Hi, There is not a command to do this. However, if you are using Cloudera Manager, you can navigate to the "Charts Library" page under the Kudu service, and then select "Tables" on the left hand side, and then select the table of interest. This should give various metrics including its size on disk (post-replication). Hope that helps -Todd
... View more
02-01-2018
02:19 PM
Hi, That issue has been reported before as https://issues.apache.org/jira/browse/KUDU-1989 but we haven't been able to reproduce it. Would it be possible to email the .metadata file to me at todd@cloudera.com? I can take a look and see if we can get closer to a root cause. Regarding the crashes, you may be able to find an error message if you look in dmesg or in the stdout/stderr files in the cloudera SCM process directory. -Todd
... View more
01-31-2018
10:05 AM
1 Kudo
It looks like your screenshot is of the "scans" dashboard on the web UI. This dashboard shows counters for a single scan, and a single scan would only come from a single task, not aggregate across them. I am guessing you're hitting KUDU-2231, a performance bug recently fixed. The bug fix appears in CDH 5.14.0. Since this is a performance issue that is not a regression and does not affect correctness, we have not yet backported to any prior releases. -Todd
... View more
01-31-2018
10:01 AM
1 Kudo
Currently Kudu does not balance on a per-table basis. So, it's possible that for a newly created table, it will not be equally spread across the cluster. Does the server with no tablets in this table have tablets from _other_ tables? The placement algorithm attempts to balance the total count of tablets across servers. We're currently working on some balancing improvements that will take per-table balancing into consideration for future releases. -Todd
... View more
01-30-2018
04:48 PM
That's correct, I am not aware of a workaround for this issue.
... View more
01-30-2018
02:53 PM
1 Kudo
Your non-JOIN queries probably work because Impala is scheduling for locality and only scheduling work on nodes with Kudu running. When you join with HDFS data, some work is scheduled on all of the nodes in the cluster, and then those tasks running on non-Kudu nodes still need to write output to Kudu. -Todd
... View more
01-30-2018
11:08 AM
Hi, Unfortunately the Kudu client is built in such a way that it requires SSE4.2. The CPU you are running on was discontinued in Q4 2010 and not supported by Kudu. That includes the Kudu client which is used by Impala. Unfortunately you will not be able to query Kudu tables in a mixed cluster with impala daemons that do not support SSE4.2. -Todd
... View more
01-30-2018
10:48 AM
That's correct. Please see https://www.cloudera.com/documentation/kudu/latest/topics/kudu_known_issues.html#schema_design_limitations which notes: - DECIMAL, CHAR, VARCHAR, DATE, and complex types such as ARRAY are not supported. We are currently working on DECIMAL and hope to have it supported in an upcoming release. -Todd
... View more
01-30-2018
10:46 AM
Hi, It sounds likely that you are hitting this bug: https://issues.apache.org/jira/browse/KUDU-2209 The bug fix for this is included in CDH 5.13.1 as well as 5.14.0, so I'd recommend upgrading at your convenience. -Todd
... View more
10-23-2017
08:43 PM
Are you looking to use the direct API or would you be OK with a SQL solution? Certainly the easiest would be to simply use a SQL query from Impala such as: update t set my_col='foo' where bar > 123; If you want to use the API, I'd suggest using the AUTO_FLUSH_BACKGROUND mode to ensure that many updates get batched together into a single round-trip RPC to the tablet servers. Todd
... View more
09-07-2017
11:17 AM
It seems like this is a PyODBC specific issue - see https://github.com/ mkleehammer/pyodbc/issues/213 where people are asking for input on how to explicitly specify the size of an int placeholder column. Maybe there is some workaround such as using: insert into genomics.pipeline_status (id, experiment_id) values (cast(? as int), cast(? as int)) ... though I have not tried that. -Todd
... View more
08-08-2017
08:52 AM
1 Kudo
It looks like you're using the C++ client. Given that, you can use the KuduSession::SetTimeout() API: https://kudu.apache.org/cpp-client-api/classkudu_1_1client_1_1KuduSession.html#a25b22362650d7120f59cc1025e40bd79 -Todd
... View more
08-07-2017
11:24 AM
1 Kudo
Hi, If you simply increase your timeout, the client itself has built-in retries and will keep trying to complete the insert until the given time has elapsed. In a scenario that is not latency-sensitive I would recommend increasing the timeout to a minute or two. -Todd
... View more
07-27-2017
11:33 AM
2 Kudos
Can you give it a try changing the encoding of your primary key int column to 'PLAIN_ENCODING' instead of the default AUTO_ENCODING? I think that should resolve your problem (at the expense of some disk space)
... View more
07-26-2017
11:26 PM
You could use 'tinker step 500' and have the effect that stepping would only be enabled for time differences more than 500ms. I wouldn't consider this breaking your production environment, but I guess you may have some reason that '-x' is important to you. We'll work on addressing this in a future release so that no system-wide changes are necessary. -Todd
... View more
07-26-2017
10:47 PM
Hi folks, I spent some time looking into this and agree that running ntpd with the '-x' option will make Kudu crash (likely after 8 hours and 53 minutes based on my math). I wrote some details here: https://issues.apache.org/jira/browse/KUDU-2079 -Todd
... View more
06-27-2017
04:53 PM
Hi Nimrod, Unfortunately we have not prioritized speeding up ALTER TABLE operations as mentioned in the post you quoted. We've found that for most users and customers, ALTER TABLE is an infrequent operation, so the few seconds it typically takes is not problematic. I'd be interested to learn more about your use case, though, to help us prioritize. -Todd
... View more
03-21-2017
10:33 PM
Hi, Are you pooling clients? Are you sure that your storm workers are not suffering GC pauses? It would be worth running the storm workers with verbose GC logging and check that you aren't seeing GC around the same time. The server side log indicates that the workers are connecting and then taking 6-10 seconds to perform the necessary round trips for RPC connection negotiation. So, the client's disconnecting them. This would also explain the throughput drop. -Todd
... View more
03-16-2017
02:20 PM
It also would be helpful to upload a copy of the maintenance manager dashboard. It may be that the "not running" tasks are not running because they are not eligible to run. -Todd
... View more
02-27-2017
10:36 PM
Hi Joaquin, It sounds like you're using an old unsupported version of IMPALA_KUDU. Could you please upgrade to Impala from CDH 5.10 and see if the problem persists? We are no longer supporting these beta releases of the Impala-Kudu integration now that the "stock" Impala in CDH 5.10 natively supports Kudu. -Todd
... View more