Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark Kudu Integration issues for upsert rows

Spark Kudu Integration issues for upsert rows


Hi Community, 


Spark DataFrames by default will use "null" for values that are unknown, missing or irrelevant. Considering this, when we define a schema for Dataframe (make a note that in general Dataframe will not have all the columns defined in the schema to be present in it) to upsert into Kudu, I observed wierd behaviour in Kudu table. I see that, updating a table in Kudu using Spark, replaces the columns that are not defined in the kudu upsert command(but present in the schema) with NULL. This is occuring because the Spark Dataframe is considering the missing values in the schema as NULL. Is this a bug? Or am I missing something here? Any inputs on working around with this?





Re: Spark Kudu Integration issues for upsert rows

Cloudera Employee

Good question, I believe this is expected behavior.  If the set of columns you'd like not to update is the same for every row in the data frame, you could select/project away the unwanted columns.  Otherwise, you may need to implement your own logic to write the rows while filtering null values, based on the KuduContext.


Re: Spark Kudu Integration issues for upsert rows


We're also facing same issue ... and any pointers will be useful.


The issue is by default dataframe assigns null values to non existing fields. The problem is there could be a valid use case where and upsert statement wants to actually update the value of a column to null i.e. delete the value. So I think the issue is not with KuduContext but with DataFrame.


I'm a Spark newbie; is there a way to control how DataFrame is created ?