07-05-2018 09:04 AM
hi,I am woking on kudu and oracle. I have more than 5 million records and i have been asked to read them from oracle and write into kudu table.what i did was,one way i did a ojdbc connection,got the records from oracle and insert into kudu table using partial row and insert menthod. i just want to know if i could do bulk inserts to avoid more time on writes
07-05-2018 10:18 AM
if i do the writes as per the program given in https://github.com/cloudera/kudu-examples/tree/master/java/java-sample/src/main/java/org/kududb/exam...
it takes an hour to insert the data in kudu table.
How can i insert the records in lesser time
07-05-2018 03:27 PM
One option is to export to Parquet on HDFS using Sqoop, then use Impala to CREATE TABLE AS SELECT * FROM your parquet table into your Kudu table.
Unfortunately Sqoop does not have support for Kudu at this time.
07-10-2018 03:00 PM
Are you sure the bottleneck is Kudu? Maybe the bottleneck is reading from Oracle?
Using the Kudu AUTO_FLUSH_BACKGROUND mode should give pretty fast throughput when writing. See https://kudu.apache.org/apidocs/org/apache/kudu/client/SessionConfiguration.FlushMode.html
You can also try increasing the KuduSession.setMutationBufferSpace() value, also consider your partitioning scheme.
If you want more parallelism you can also consider scanning different ranges in Oracle with different processes or threads on the same or different client machine and perform more parallelized writes to Kudu.