I was going through some of the kudu spark documentation for CDH 6.1.x located here:
now, I would like to use the Upsert ignoreNull Option that is described there. This page also describes mentions that kudu-spark is available up to version 1.7, but as far as I can tell this option is only available starting from kudu spark version 1.8. Compare the kudu documentation on the apache website for 1.7 versus 1.8:
I have also tried using both kudu-spark 1.7 and 1.8, and as expected the upsert ignoreNulls Option is only available from 1.8 onwards.
This leads me to my question: for CDH 6.1.x, what versions of kudu-spark are supported? Up to 1.7, or 1.8?
In either case, an update to the official cloudera documentation might be in order, to more consistently reflect the available functionality and/or supported versions.
Thank you for reporting the issue!
With CDH6.1.0, kudu-spark2_2.11-1.8.0-cdh6.1.0.jar is available:
However, applications can use kudu-spark2_2.11-1.7.0 with Kudu server side of CDH6.1.0 (i.e. the older version of kudu_spark2_11 is 'supported' at least in this sense).
Yes, you are right: in the Apache Kudu git repo, the UPSERT ignoreNull option is available Kudu 1.8.0 and onward. For CDH, the UPSERT ignoreNull option is available starting kudu-spark2_2.11-1.8.0, it's not available in older versions (i.e. kudu-spark2_2.11-1.7.0 doesn't have it).
I'll try to reach out to see whether the inconsistency you pointed can be fixed in CDH6.1.0 online documentation.
Thank you for the clarification, Alexey. Much appreciated
Edit: come to think of it: do you know anything about the relative efficiencies of upsert with ignoreNulls versus retrieving a dataframe from the table, doing my modifications in memory, and then upserting? Does kudu/spark do something similar under the hood, so there is little expected performance gain, or is it really a less "expensive" operation to do an update with ignoreNulls?