About preethyvarma

preethyvarma · ‎08-12-2019

I have a PySpark dataframe with 87 columns. I want to pass each row of the dataframe to a function and get a list for each row so that I can create a column separately. ` PySpark code UDF: def make_range_vector(row,categories,ledger): print(type(row),type(categories),type(ledger)) category_vector=[] for category in categories: if(row[category]!=0): category_percentage=func.round(row[category]*100/row[ledger]) category_vector.append(category_percentage) else: category_vector.append(0) category_vector=sqlCtx.createDataFrame(category_vector,IntegerType()) return category_vector Main function pivot_card.withColumn('category_debit_vector',(make_range_vector(struct([pivot_card[x] for x in pivot_card.columns] ),pivot_card.columns[3:],'debit'))) I am beginner in PySpark, and I can't find answers to below questions. Print statement outputs <class 'pyspark.sql.column.Column'> <class 'list'> <class #'str'> . Shouldn't it be StructType? Can I pass a Row object and do something similar, like we do in Pandas ?

Online	Offline
Last Visited	‎08-12-2019 11:33 AM

Member Since	‎08-12-2019 06:37 AM
Last Visited	‎08-12-2019 11:33 AM
Posts	1

Cloudera Community

Re: Rowwise manipulation of a DataFrame in PySpark...