>>> rdd = df.rdd.map(list).map(lambda lista: lista.split())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 84, in rdd
jrdd = self._jdf.javaToPython()
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1), stackTrace)
pyspark.sql.utils.AnalysisException: u'Queries with streaming sources must be executed with writeStream.start();;\nkafka'
Since it doesn't seem you can easily transform a dataframe into an RDD in Spark's structured streaming, I found a way to manipulate the dataset to fit my needs. I used the split function from the pyspark.sql.functions module to split the contents of the dataframe's column (a string containing the independent variables for my ML model) into several new columns and then I used the VectorAssembler class from pyspark.ml to merge the new columns into a vector column.