About Tomas79

Tomas79 · ‎01-29-2018

Hi, I did not found any documents describing how the spark tasks are assigned to which exectur when data is read from Kudu into dataframes. I noticed, that in some cases (did not have enough time to test) Spark reads data ONLY from the Leaders of the tablets, so moving data across network. Is there any setting or configuration for co-locate the spark task in an executor with a Kudu tablet? Based on the Kudu documentation, the LEADER is for write, but the FOLLOWERs can server reads too.. Thanks

Tomas79 · ‎01-29-2018

So the correct answer is that: Tables with range partitions defined via upper and lower boundaries cannot be extended. Tables with partitions defined as a single value can be extended.

Tomas79 · ‎01-28-2018

And also on p.29: New Features in Kudu 0.10.0 • Users may now manually manage the partitioning of a range-partitioned table. When a table is created, the user may specify a set of range partitions that do not cover the entire available key space. A user may add or drop range partitions to existing tables. This feature can be particularly helpful with time series workloads in which new partitions can be created on an hourly or daily basis. Old partitions may be efficiently dropped if the application does not need to retain historical data past a certain point.

Tomas79 · ‎01-28-2018

It is confusing, Apache Kudu User Guide, p.27: Partitioning Limitations • Tables must be manually pre-split into tablets using simple or compound primary keys. Automatic splitting is not yet possible. Range partitions may be added or dropped after a table has been created. See Schema Design for more information.

Tomas79 · ‎01-20-2018

Hi, I have a simple table with range partitions defined by upper and lower bounds. CREATE TABLE work.sales_by_year ( year INT, sale_id INT, amount INT, PRIMARY KEY (sale_id, year) ) PARTITION BY RANGE (year) ( PARTITION VALUES < 2015, PARTITION 2015 <= VALUES < 2016, PARTITION 2016 <= VALUES ) STORED AS KUDU; So this table has three partitions: +--------+-----------+----------+-------------------------------------------------+------------+ | # Rows | Start Key | Stop Key | Leader Replica | # Replicas | +--------+-----------+----------+-------------------------------------------------+------------+ | -1 | 800007DF | host1:7050 | 3 | | -1 | 800007DF | 800007E0 | host2:7050 | 3 | | -1 | 800007E0 | | host3:7050 | 3 | +--------+-----------+----------+-------------------------------------------------+------------+ Now I would like to end the last range with 2017 and have another interval for values >= 2017. I tried multiple syntaxes, but it does not work: alter table work.sales_by_year add range partition 2016 <= VALUES < 2017; Query: alter table work.sales_by_year add range partition 2016 <= VALUES < 2017 ERROR: ImpalaRuntimeException: Error adding range partition in table sales_by_year CAUSED BY: NonRecoverableException: New range partition conflicts with existing range partition: 2016 <= VALUES < 2017 alter table work.sales_by_year add range partition VALUE = 2017; Query: alter table work.sales_by_year add range partition VALUE = 2017 ERROR: ImpalaRuntimeException: Error adding range partition in table sales_by_year CAUSED BY: NonRecoverableException: New range partition conflicts with existing range partition: 2017 <= VALUES < 2018 These error messages are misleading, if I run show partitions, I am having still those three intervals, so no 2017 and 2018. Any hints how to extend the range partitons? Thanks

Tomas79 · ‎01-19-2018

Hi, can somebody give a hint or guideline how to maximize the Kudu scan (read from kudu table) performance from Spark? I tried a simple dataframe read, tried also to create multiple data frames, where each had different filters on one of the column in the primary key columns, and then union the dataframes and write to HDFS but it seems to me that the Tablet server is handling out the data via one scanner, so there are 5 tablet servers, 5 scanners and 5 tasks in 5 execturos. Is it possible to trigger more scanners via spark? Thanks

Tomas79 · ‎01-10-2018

I stopped CDH and did a Kerberos configuration redeploy. The /etc/krb5.conf is more or less the same. The only difference is the last line "[domain_realm], it was added by the CM. After the redeploy the CDH started and now everything is in green Thanks Tomas [libdefaults] default_realm = MYREALM.LOCAL dns_lookup_kdc = false dns_lookup_realm = false ticket_lifetime = 86400 renew_lifetime = 604800 forwardable = true default_tgs_enctypes = aes256-cts aes128-cts default_tkt_enctypes = aes256-cts aes128-cts permitted_enctypes = aes256-cts aes128-cts udp_preference_limit = 1 kdc_timeout = 3000 [realms] MYREALM.LOCAL = { kdc = 10.197.16.197 10.197.16.88 admin_server = 10.197.16.197 10.197.16.88 } [domain_realm]

Tomas79 · ‎01-05-2018

Hi, after an upgrade from CM 5.11 to 5.13 the Cloudera Manager complains with a red excl mark: Cluster has stale Kerberos client configuration. The cluster was all in green before upgrade and had no problem with kerberos configs (/etc/krb5.conf). What is more concerning, that after opening this warning, three (gateway) nodes does not require upgrade, but the rest of them does: Consider stopping roles on these hosts to ensure that they are updated by this command: ip-10-197-13-169.eu-west-1.compute.internal; ip-10-197-15-82.eu-west-1.compute.internal; ip-10-197-18-[113, 248].eu-west-1.compute.internal... But the command is not there. What should I do? Stop the whole CDH and then rerun the deploy? Thanks for the advise, T.

Tomas79 · ‎12-06-2017

No. Then I dont know. Can you paste here all the commands how you generated the keystore and keys?

Tomas79 · ‎12-06-2017

I dont know the solution, but I think cloudera manager agent which is running under root is starting these processes and sets the correct permissions. Now I could imagine, that maybe the cloudera-scm-agent is not running under root, or maybe the permissions are set wrongly. I would check the exact process directory of the hive metastore. I would also verify in processes, that just one copy of HiveMetastore is running (to eliminate the possibility that two processes are running in the same time) Try to restart the cloudera-scm-agent.

Online	Offline
Last Visited	‎01-14-2021 05:46 AM

Member Since	‎07-01-2015 06:03 AM
Last Visited	‎01-14-2021 05:46 AM
Posts	460
Kudos received	79

Cloudera Community

Re: Read service-wide configuration values via API

Re: Cloudera Altus - create CM with existing postg...

Re: Spark job getting failed with Jupyter notebook

Re: Create Parameterized view Impala

Re: Unable to access NameNode in cross realm trust...

Read data from Kudu via Spark

Re: Kudu range partitons extension

Re: Kudu range partitons extension

Re: Kudu range partitons extension

Kudu range partitons extension

Kudu scan maximize throughput via Spark

Re: CM upgrade - stale Kerberos configuration

CM upgrade - stale Kerberos configuration

Re: Key protection algorithm not found: java.secur...

Re: Hive Metastore fails to start for a newly inst...