Expert Contributor
Posts: 171
Registered: ‎07-01-2015
Accepted Solution

Kudu scan maximize throughput via Spark


 can somebody give a hint or guideline how to maximize the Kudu scan (read from kudu table) performance from Spark? I tried a simple dataframe read, tried also to create multiple data frames, where each had different filters on one of the column in the primary key columns, and then union the dataframes and write to HDFS but it seems to me that the Tablet server is handling out the data via one scanner, so there are 5 tablet servers, 5 scanners and 5 tasks in 5 execturos.


Is it possible to trigger more scanners via spark?




Cloudera Employee
Posts: 7
Registered: ‎09-28-2015

Re: Kudu scan maximize throughput via Spark

Hi Tomas,


The kudu-spark integration will create  one task/executor per Kudu tablet, each with a single scanner.  If you want to achieve more parallelism you can add more tablets/partitions to the Kudu table.