Created on 06-15-2018 03:08 AM - edited 09-16-2022 06:21 AM
Hi are there any tricks in reading a CSV into a dataframe and defining one of the columns as an array. Check it out, here is my CSV file:
1|agakhanpark,science centre,sunnybrookpark,laird,leaside,mountpleasant,avenue 2|agakhanpark,wynford,sloane,oconnor,pharmacy,hakimilebovic,goldenmile,birchmount
All I want to do is transform this into a dataframe which would look something like:
Col1 Col2 1 [agakhanpark,science centre,sunnybrookpark,laird,leaside,mountpleasant,avenue] 2 [agakhanpark,wynford,sloane,oconnor,pharmacy,hakimilebovic,goldenmile,birchmount]
I'm able to define the dataframe was an array but when i go to show() I get a big long error. Here's the pyspark code
data_schema = [StructField('id', IntegerType(), False),StructField('route', ArrayType(StringType()),False)] final_struc = StructType(fields=data_schema) spark = SparkSession.builder.appName('Alex').getOrCreate() df = spark.read.option("delimiter", "|").csv('output2.csv',schema=final_struc) df.show()
Traceback (most recent call last):
File "/Users/awitte/Documents/GitHub/cmx-hadoop-pipe/sparkProcess.py", line 20, in <module>
df.show()
File "/usr/local/Cellar/apache-spark/2.2.0/libexec/python/pyspark/sql/dataframe.py", line 336, in show
print(self._jdf.showString(n, 20))
File "/usr/local/Cellar/apache-spark/2.2.0/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/local/Cellar/apache-spark/2.2.0/libexec/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/local/Cellar/apache-spark/2.2.0/libexec/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o35.showString.
: java.lang.UnsupportedOperationException: CSV data source does not support array<string> data type.
Any thoughts?
Created 06-15-2018 10:56 AM
According to your question, you want to transform it to the below format
Col1 Col2 1 [agakhanpark,science centre,sunnybrookpark,laird,leaside,mountpleasant,avenue] 2 [agakhanpark,wynford,sloane,oconnor,pharmacy,hakimilebovic,goldenmile,birchmount]
I have changed your code little bit and was able to achieve it. Please check this code and the pyspark execution output
from pyspark.sql.types import * data_schema = [StructField('id', IntegerType(), False),StructField('route', StringType(),False)] final_struc = StructType(fields=data_schema) df = sqlContext.read.option("delimiter", "|").csv('/user/hrt_qa/a.txt',schema=final_struc) df.show() from pyspark.sql.functions import udf def str_to_arr(my_list): my_list = my_list.split(",") return '[' + ','.join([str(elem) for elem in my_list]) + ']' str_to_arr_udf = udf(str_to_arr,StringType()) df = df.withColumn('route_arr',str_to_arr_udf(df["route"])) df = df.drop("route") df.show()
>>> from pyspark.sql.types import * >>> data_schema = [StructField('id', IntegerType(), False),StructField('route', StringType(),False)] >>> final_struc = StructType(fields=data_schema) >>> df = sqlContext.read.option("delimiter", "|").csv('/user/hrt_qa/a.txt',schema=final_struc) >>> df.show() +---+--------------------+ | id| route| +---+--------------------+ | 1|agakhanpark,scien...| | 2|agakhanpark,wynfo...| +---+--------------------+ >>> >>> >>> from pyspark.sql.functions import udf >>> def str_to_arr(my_list): ... my_list = my_list.split(",") ... return '[' + ','.join([str(elem) for elem in my_list]) + ']' ... >>> str_to_arr_udf = udf(str_to_arr,StringType()) >>> df = df.withColumn('route_arr',str_to_arr_udf(df["route"])) >>> df = df.drop("route") >>> df.show() +---+--------------------+ | id| route_arr| +---+--------------------+ | 1|[agakhanpark,scie...| | 2|[agakhanpark,wynf...| +---+--------------------+
.
Please "Accept" the answer if this helps.
.
-Aditya
Created 06-15-2018 10:56 AM
According to your question, you want to transform it to the below format
Col1 Col2 1 [agakhanpark,science centre,sunnybrookpark,laird,leaside,mountpleasant,avenue] 2 [agakhanpark,wynford,sloane,oconnor,pharmacy,hakimilebovic,goldenmile,birchmount]
I have changed your code little bit and was able to achieve it. Please check this code and the pyspark execution output
from pyspark.sql.types import * data_schema = [StructField('id', IntegerType(), False),StructField('route', StringType(),False)] final_struc = StructType(fields=data_schema) df = sqlContext.read.option("delimiter", "|").csv('/user/hrt_qa/a.txt',schema=final_struc) df.show() from pyspark.sql.functions import udf def str_to_arr(my_list): my_list = my_list.split(",") return '[' + ','.join([str(elem) for elem in my_list]) + ']' str_to_arr_udf = udf(str_to_arr,StringType()) df = df.withColumn('route_arr',str_to_arr_udf(df["route"])) df = df.drop("route") df.show()
>>> from pyspark.sql.types import * >>> data_schema = [StructField('id', IntegerType(), False),StructField('route', StringType(),False)] >>> final_struc = StructType(fields=data_schema) >>> df = sqlContext.read.option("delimiter", "|").csv('/user/hrt_qa/a.txt',schema=final_struc) >>> df.show() +---+--------------------+ | id| route| +---+--------------------+ | 1|agakhanpark,scien...| | 2|agakhanpark,wynfo...| +---+--------------------+ >>> >>> >>> from pyspark.sql.functions import udf >>> def str_to_arr(my_list): ... my_list = my_list.split(",") ... return '[' + ','.join([str(elem) for elem in my_list]) + ']' ... >>> str_to_arr_udf = udf(str_to_arr,StringType()) >>> df = df.withColumn('route_arr',str_to_arr_udf(df["route"])) >>> df = df.drop("route") >>> df.show() +---+--------------------+ | id| route_arr| +---+--------------------+ | 1|[agakhanpark,scie...| | 2|[agakhanpark,wynf...| +---+--------------------+
.
Please "Accept" the answer if this helps.
.
-Aditya