Spark DataFrames by default will use "null" for values that are unknown, missing or irrelevant. Considering this, when we define a schema for Dataframe (make a note that in general Dataframe will not have all the columns defined in the schema to be present in it) to upsert into Kudu, I observed wierd behaviour in Kudu table. I see that, updating a table in Kudu using Spark, replaces the columns that are not defined in the kudu upsert command(but present in the schema) with NULL. This is occuring because the Spark Dataframe is considering the missing values in the schema as NULL. Is this a bug? Or am I missing something here? Any inputs on working around with this?
Good question, I believe this is expected behavior. If the set of columns you'd like not to update is the same for every row in the data frame, you could select/project away the unwanted columns. Otherwise, you may need to implement your own logic to write the rows while filtering null values, based on the KuduContext.
We're also facing same issue ... and any pointers will be useful.
The issue is by default dataframe assigns null values to non existing fields. The problem is there could be a valid use case where and upsert statement wants to actually update the value of a column to null i.e. delete the value. So I think the issue is not with KuduContext but with DataFrame.
I'm a Spark newbie; is there a way to control how DataFrame is created ?