Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to apply UDF "rowwise" using pyspark?


How to apply UDF "rowwise" using pyspark?

New Contributor

I have clustered my data with a k-means model and want to add the results for each row to my dataframe. Anyone an idea how I can apply a user defined function rowwise to a dataframe?

Here's my UDF:

def get_cluster_center(latitude, longitude, cluster_model):

print "Latitude: ", latitude
print "Longitude: ", longitude

point = [latitude, longitude]
cluster_center = cluster_model.predict(point)
print "Cluster center: ", cluster_center
column_value = struct(lit("cluster_center"), lit(cluster_center[0]), lit(cluster_center[1]))
print "Type of column_value: ", column_value
return column_value

I tried to to something like this:

sensordata_df = sensordata_df.withColumn('cluster_center', get_cluster_center(sensordata_df["location.latitude"], sensordata_df["location.longitude"], cluster_model))
Don't have an account?
Coming from Hortonworks? Activate your account here