Support Questions
Find answers, ask questions, and share your expertise

Inconsistency in documentation/request for clarification of version compatibility

New Contributor

Hi,

 

I was going through some of the kudu spark documentation for CDH 6.1.x located here:

https://www.cloudera.com/documentation/enterprise/6/6.1/topics/kudu_development.html

now, I would like to use the Upsert ignoreNull Option that is described there. This page also describes mentions that kudu-spark is available up to version 1.7, but as far as I can tell this option is only available starting from kudu spark version 1.8. Compare the kudu documentation on the apache website for 1.7 versus 1.8:

https://kudu.apache.org/releases/1.8.0/docs/developing.html

https://kudu.apache.org/releases/1.7.1/docs/developing.html

 

I have also tried using both kudu-spark 1.7 and 1.8, and as expected the upsert ignoreNulls Option is only available from 1.8 onwards.

 

This leads me to my question: for CDH 6.1.x, what versions of kudu-spark are supported? Up to 1.7, or 1.8?

In either case, an update to the official cloudera documentation might be in order, to more consistently reflect the available functionality and/or supported versions.

2 REPLIES 2

Contributor

Hi,

 

Thank you for reporting the issue!

 

With CDH6.1.0, kudu-spark2_2.11-1.8.0-cdh6.1.0.jar is available:

 https://archive.cloudera.com/cdh6/6.1.0/maven-repository/org/apache/kudu/kudu-spark2_2.11/1.8.0-cdh6...

 

However, applications can use kudu-spark2_2.11-1.7.0 with Kudu server side of CDH6.1.0 (i.e. the older version of kudu_spark2_11 is 'supported' at least in this sense).

 

Yes, you are right: in the Apache Kudu git repo, the UPSERT ignoreNull option is available Kudu 1.8.0 and onward.  For CDH, the UPSERT ignoreNull option is available starting kudu-spark2_2.11-1.8.0, it's not available in older versions (i.e. kudu-spark2_2.11-1.7.0 doesn't have it).

 

I'll try to reach out to see whether the inconsistency you pointed can be fixed in CDH6.1.0 online documentation.

 

 

Thanks,

 

Alexey

New Contributor

Thank you for the clarification, Alexey. Much appreciated

Edit: come to think of it: do you know anything about the relative efficiencies of upsert with ignoreNulls versus retrieving a dataframe from the table, doing my modifications in memory, and then upserting? Does kudu/spark do something similar under the hood, so there is little expected performance gain, or is it really a less "expensive" operation to do an update with ignoreNulls?

🙂

; ;