Member since
04-08-2014
70
Posts
20
Kudos Received
12
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3796 | 07-16-2018 04:12 PM | |
3911 | 07-13-2018 03:17 PM | |
4691 | 07-10-2018 03:00 PM | |
4085 | 07-10-2018 02:54 PM | |
4946 | 07-05-2018 03:35 PM |
04-25-2019
12:43 AM
Let's try to rule out various types of problems. 1. Are you able to read/write to Kerberos-enabled HDFS with PySpark? Is Kudu the only Kerberos-enabled service that is not working from within PySpark? 2. Have you checked to ensure that the Spark driver is running on the host and shell you kinited from instead of being started in a YARN container? If it's running in YARN you have to give YARN access to the keytab to run as. 3. Have you tried connecting to Kudu with the regular Spark shell? Does it work? For examples see https://kudu.apache.org/docs/developing.html#_kudu_integration_with_spark
... View more
04-23-2019
05:45 PM
Is your cluster Kerberos-enabled? If so, did you kinit before running the job? Try a local driver before trying a distributed driver to rule out keytab-related issues.
... View more
01-16-2019
08:55 AM
1 Kudo
Kudu runs as a separate service that Impala talks to (like HDFS runs as a separate service from Impala) so you have to have Kudu running somewhere for it to work. However you don't have to run Kudu on the same servers that you run Impala on -- remote reads are supported over the network.
... View more
01-16-2019
08:46 AM
EricL is correct, you don't need to worry about files with Kudu in the same way that you have to worry about them with typical Hive tables. Kudu stores its data directly on ext4 in a distributed way and does not use HDFS. You can take a look at where Kudu is storing its data on the local file system if you go into Cloudera manager and take a look at how the --fs-data-dirs and --fs-wal-dir configuration options are set up across the various Tablet Server nodes. Hope that helps, Mike
... View more
11-28-2018
03:27 PM
I would strongly recommend upgrading from your older version of Kudu because there have been many improvements to address the issues you are describing. See the release notes for the Kudu releases after 1.3.0, many of these fixes will help you: https://kudu.apache.org/releases/1.4.0/docs/release_notes.html - The default size for Write Ahead Log (WAL) segments has been reduced from 64MB to 8MB. Additionally, in the case that all replicas of a tablet are fully up to date and data has been flushed from memory, servers will now retain only a single WAL segment rather than two. These changes are expected to reduce the average consumption of disk space on the configured WAL disk by 16x, as well as improve the startup speed of tablet servers by reducing the number and size of WAL segments that need to be re-read. - The Maintenance Manager has been improved to improve utilization of the configured maintenance threads. Previously, maintenance work would only be scheduled a maximum of 4 times per second, but now maintenance work will be scheduled immediately whenever any configured thread is available. This can improve the throughput of write-heavy workloads. - KUDU-2020 Fixed an issue where re-replication after a failure would proceed significantly slower than expected. This bug caused many tablets to be unnecessarily copied multiple times before successfully being considered re-replicated, resulting in significantly more network and IO bandwidth usage than expected. Mean time to recovery on clusters with large amounts of data is improved by up to 10x by this fix. https://kudu.apache.org/releases/1.6.0/docs/release_notes.html - Tablet server startup time has been improved significantly on servers containing large numbers of blocks. https://kudu.apache.org/releases/1.7.0/docs/release_notes.html - The strategy Kudu uses for automatically healing tablets which have lost a replica due to server or disk failures has been improved. The new re-replication strategy, or replica management scheme, first adds a replacement tablet replica before evicting the failed one. With the previous replica management scheme, the system first evicts the failed replica and then adds a replacement. The new replica management scheme allows for much faster recovery of tablets in scenarios where one tablet server goes down and then returns back shortly after 5 minutes or so. The new scheme also provides substantially better overall stability on clusters with frequent server failures. (see KUDU-1097). https://kudu.apache.org/releases/1.8.0/docs/release_notes.html - Introduced manual data rebalancer into the kudu CLI tool. The rebalancer can be used to redistribute table replicas among tablet servers. The rebalancer can be run via kudu cluster rebalance sub-command. Using the new tool, it’s possible to rebalance Kudu clusters of version 1.4.0 and newer. (note: CDH 5.16.1 doesn't include everything new from Kudu 1.8.0, only a few things like the rebalancer, but CDH 5.15.1 includes everything from Kudu 1.7.0 and earlier) If you can, upgrade to CDH 5.15.1 or CDH 5.16.1 There are also many other improvements unrelated to startup time that I have not called out here, such as greatly reducing the thread count, various optimizations, many other bug fixes, and lots of improvements for observability and operability.
... View more
11-27-2018
01:02 PM
1 Kudo
A WAL file is a Kudu tablet write-ahead log file. You can read an overview of how the Kudu write path works here (it's a fairly techincal blog post): https://blog.cloudera.com/blog/2017/04/apache-kudu-read-write-paths/ The WAL file location is controlled by the configuration parameter --fs_wal_dir which you can read about at https://kudu.apache.org/docs/configuration_reference.html#kudu-tserver_fs_wal_dir
... View more
09-28-2018
11:36 AM
FYI, I think my reply from 9/21 was wrong. As far as I can tell, the rules work as follows: 1. If the Kudu table is managed by Impala, it's not possible to change the kudu.table_name and kudu.master_addresses properties. This is the case when it's not an EXTERNAL table. See https://issues.apache.org/jira/browse/IMPALA-5654 for more information on that. I have filed an improvement request to track automatically renaming the Kudu table when the Impala table is renamed to keep them in sync, but right now it's not possible. See https://issues.apache.org/jira/browse/IMPALA-7640 for more information. 2. If you have an EXTERNAL table (Kudu table not managed by Impala) then you are able to alter the kudu.table_name table property. The above was tested on a non-secure cluster, and I would be interested to hear if others' experiences are the same as mine were even on a secured cluster. However I believe the behavior is the same in both cases. Hope this helps, Mike
... View more
09-25-2018
04:08 PM
Just following up here, I just tested this on Impala version 2.13 (dev version) and I cannot reproduce the ability to alter table set tblproperties to rename the Kudu table name even after altering the Impala table name. Is anyone else able to reproduce this? I get the following error: > alter table mpercy_k2 set tblproperties('kudu.table_name'='impala::default.mpercy_k2'); Query: alter table mpercy_k2 set tblproperties('kudu.table_name'='impala::default.mpercy_k2') ERROR: AnalysisException: Not allowed to set 'kudu.table_name' manually for managed Kudu tables . However this is by design from what I have discussed with some others. I think the "bug" is that Impala alter table doesn't automatically rename the Kudu table internally. However it would be a security problem to be able to alter the kudu table name with tblproperties because Sentry applies the security rules to the Impala table name.
... View more
09-21-2018
06:35 PM
2 Kudos
@Ankit_Mishra's answer is the correct way to do the procedure you want to do, Impala doesn't allow for separately managing the Kudu and Impala tables if you create the Kudu table through Impala.
... View more
09-21-2018
06:16 PM
@Andreyeff Another thing you can try doing is increasing the raft heartbeat interval from 500ms to 1500ms or even 3000ms, see https://kudu.apache.org/docs/configuration_reference.html#kudu-tserver_raft_heartbeat_interval_ms This will affect your recovery time by a few seconds if a leader fails since by default, elections don't happen for 3 missed heartbeat periods (controlled by https://kudu.apache.org/docs/configuration_reference.html#kudu-tserver_leader_failure_max_missed_heartbeat_periods )
... View more
09-18-2018
09:47 AM
Frankly it sounds like you should revisit your capacity planning. You can try bumping up raft consensus timeouts and upgrading to the latest version of Kudu but it may not help that much.
... View more
09-17-2018
03:58 PM
There is documentation for how to enable Kudu security on CDH 5.13.0 here: https://www.cloudera.com/documentation/enterprise/5-13-x/topics/kudu_security.html#concept_syg_k35_lz Please follow those steps and let us know if it still doesn't work for you. Thanks, Mike
... View more
09-14-2018
10:46 AM
Some more questions: When was the last time the cluster worked? What has changed since then?
... View more
09-13-2018
12:21 PM
> 3) Do you see any errors when you run the following command? > > sudo -u kudu kudu cluster ksck <master-addresses> > > See https://www.cloudera.com/documentation/enterprise/5-13-x/topics/kudu_administration_cli.html#ksck for documentation on running ksck. => yes a lot... OK, you will need to take a look at the tserver logs to figure out what is going on. But it sounds like something is wrong with your tablet servers. Can you post any error messages you see in kudu-tserver.INFO logs? > 4) Is Impala configured with the correct --kudu_master_hosts flag? It should be configured to talk to all of the > masters. See https://www.cloudera.com/documentation/enterprise/5-13-x/topics/kudu_impala.html for > documentation on that. No, how can I configue --kudu_master_hosts in cloudera manager, I don't find this setting ? I just checked my dev cluster and you probably don't have to change anything; Cloudera Manager will automatically set it for Impala if you have a Kudu Service configured for it. I think your problem is with your Kudu tablet servers, not with Impala.
... View more
09-12-2018
02:50 PM
Tomas is correct; the latest version of Kudu can support adding / removing directories, but cannot rebalance the data usage across Kudu data directories. FYI, I filed a tracking JIRA for this feature request at https://issues.apache.org/jira/browse/KUDU-2577
... View more
09-12-2018
02:43 PM
It sounds like Impala might be configured to talk to the wrong master, or one of the Kudu masters is stuck and needs to be repaired. 1) How many Kudu master servers are you running? 2) Do you see any error messages in the Kudu master log file(s)? 3) Do you see any errors when you run the following command? sudo -u kudu kudu cluster ksck <master-addresses> See https://www.cloudera.com/documentation/enterprise/5-13-x/topics/kudu_administration_cli.html#ksck for documentation on running ksck. 4) Is Impala configured with the correct --kudu_master_hosts flag? It should be configured to talk to all of the masters. See https://www.cloudera.com/documentation/enterprise/5-13-x/topics/kudu_impala.html for documentation on that.
... View more
09-10-2018
06:04 PM
Unfortunately you are already past the number of recommended / supported tablets per server, see https://www.cloudera.com/documentation/enterprise/latest/topics/kudu_limitations.html#scaling_limits In general it will require engineering work to push past that limit and we don't have anyone working on it at the moment.
... View more
08-01-2018
11:55 AM
@Andreyeff please remind us what version you are running at this time? Do you see anything related in the Catalog Daemon log?
... View more
07-16-2018
04:13 PM
However I see you wrote a separate forum post which is good, we try to stick with one topic per thread in the forums.
... View more
07-16-2018
04:12 PM
If you have the data in Oracle I would suggest writing it to Parquet on HDFS using Sqoop first. After that, you will be able to transfer the data to Kudu using Impala with a command like CREATE TABLE kudu_table STORED AS KUDU AS SELECT * FROM parquet_table;
... View more
07-13-2018
03:17 PM
1 Kudo
The only way I know of to do complex queries through Java is to use the Impala JDBC connector, which you can find here: https://www.cloudera.com/downloads/connectors/impala/jdbc/2-6-3.html
... View more
07-12-2018
09:32 AM
1 Kudo
Kudu is only a storage engine. If you want sophisticated query processing capabilities, you have to use a query engine on top of Kudu that has an integration. Mainly that would be Impala or Spark. You can use JDBC or Spark APIs to access those systems from Java. Here is how to use Impala with Kudu: https://www.cloudera.com/documentation/enterprise/5-14-x/topics/kudu_impala.html Here is a blog article showing how to use Spark with Kudu in Java: https://blog.cloudera.com/blog/2017/02/up-and-running-with-apache-spark-on-apache-kudu/ Does that answer your question?
... View more
07-11-2018
02:54 PM
Sounds like good news. Thanks for the update!
... View more
07-10-2018
03:00 PM
1 Kudo
Are you sure the bottleneck is Kudu? Maybe the bottleneck is reading from Oracle? Using the Kudu AUTO_FLUSH_BACKGROUND mode should give pretty fast throughput when writing. See https://kudu.apache.org/apidocs/org/apache/kudu/client/SessionConfiguration.FlushMode.html You can also try increasing the KuduSession.setMutationBufferSpace() value, also consider your partitioning scheme. If you want more parallelism you can also consider scanning different ranges in Oracle with different processes or threads on the same or different client machine and perform more parallelized writes to Kudu.
... View more
07-10-2018
02:54 PM
1 Kudo
Hi HJ, It is not possible to do a join using the native Kudu NoSQL API. You will need to use SQL with Impala or Spark SQL, or using the Spark data frame APIs to do the join. Mike
... View more
07-09-2018
11:41 AM
You're welcome. If that worked for you, please mark my response as the answer / solution to your question.
... View more
07-05-2018
03:35 PM
1 Kudo
Another option is to write a Spark job that uses multiple tasks to read from Oracle and write to Kudu in parallel, or something equivalent using multiple processes or threads.
... View more
07-05-2018
03:27 PM
One option is to export to Parquet on HDFS using Sqoop, then use Impala to CREATE TABLE AS SELECT * FROM your parquet table into your Kudu table. Unfortunately Sqoop does not have support for Kudu at this time.
... View more
07-05-2018
10:44 AM
2 Kudos
You will have to use "dd" to remove the last record of the container file. The latest version of Kudu trunk (after 5.15) contains a --debug option to the "kudu pbc dump" tool that will tell you the offset of the file you should remove from the file, if you compile it. If you can't compile Kudu from source to obtain that tool, then an easy option is to reformat the affected tablet server and start from scratch on that server, if you have additional replicas. Another option is to use a hex editor to figure out the offset where there are is a run of 0s at the end of the file and truncate the 0s off of the file. Make sure to make a backup copy of the container metadata file first. This will be prevented in a future release.
... View more