Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Pyspark can't show() a CSV with an array

avatar

Hi are there any tricks in reading a CSV into a dataframe and defining one of the columns as an array. Check it out, here is my CSV file:

1|agakhanpark,science centre,sunnybrookpark,laird,leaside,mountpleasant,avenue
2|agakhanpark,wynford,sloane,oconnor,pharmacy,hakimilebovic,goldenmile,birchmount

All I want to do is transform this into a dataframe which would look something like:

Col1	Col2
1 	[agakhanpark,science centre,sunnybrookpark,laird,leaside,mountpleasant,avenue]
2 	[agakhanpark,wynford,sloane,oconnor,pharmacy,hakimilebovic,goldenmile,birchmount]

I'm able to define the dataframe was an array but when i go to show() I get a big long error. Here's the pyspark code

data_schema = [StructField('id', IntegerType(), False),StructField('route', ArrayType(StringType()),False)]
final_struc = StructType(fields=data_schema)
spark = SparkSession.builder.appName('Alex').getOrCreate()
df = spark.read.option("delimiter", "|").csv('output2.csv',schema=final_struc)
df.show()

Traceback (most recent call last):

File "/Users/awitte/Documents/GitHub/cmx-hadoop-pipe/sparkProcess.py", line 20, in <module>

df.show()

File "/usr/local/Cellar/apache-spark/2.2.0/libexec/python/pyspark/sql/dataframe.py", line 336, in show

print(self._jdf.showString(n, 20))

File "/usr/local/Cellar/apache-spark/2.2.0/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__

File "/usr/local/Cellar/apache-spark/2.2.0/libexec/python/pyspark/sql/utils.py", line 63, in deco

return f(*a, **kw)

File "/usr/local/Cellar/apache-spark/2.2.0/libexec/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling o35.showString.

: java.lang.UnsupportedOperationException: CSV data source does not support array<string> data type.

Any thoughts?

1 ACCEPTED SOLUTION

avatar
Super Guru

@Alex Witte,

According to your question, you want to transform it to the below format

Col1   	Col2
1	[agakhanpark,science centre,sunnybrookpark,laird,leaside,mountpleasant,avenue]
2	[agakhanpark,wynford,sloane,oconnor,pharmacy,hakimilebovic,goldenmile,birchmount]

I have changed your code little bit and was able to achieve it. Please check this code and the pyspark execution output

from pyspark.sql.types import *
data_schema = [StructField('id', IntegerType(), False),StructField('route', StringType(),False)]
final_struc = StructType(fields=data_schema)
df = sqlContext.read.option("delimiter", "|").csv('/user/hrt_qa/a.txt',schema=final_struc)
df.show()
from pyspark.sql.functions import udf
def str_to_arr(my_list):
    my_list = my_list.split(",")
    return '[' + ','.join([str(elem) for elem in my_list]) + ']'
str_to_arr_udf = udf(str_to_arr,StringType())
df = df.withColumn('route_arr',str_to_arr_udf(df["route"]))
df = df.drop("route")
df.show()
>>> from pyspark.sql.types import *
>>> data_schema = [StructField('id', IntegerType(), False),StructField('route', StringType(),False)]
>>> final_struc = StructType(fields=data_schema)
>>> df = sqlContext.read.option("delimiter", "|").csv('/user/hrt_qa/a.txt',schema=final_struc)
>>> df.show()
+---+--------------------+
| id|               route|
+---+--------------------+
|  1|agakhanpark,scien...|
|  2|agakhanpark,wynfo...|
+---+--------------------+
>>>
>>>
>>> from pyspark.sql.functions import udf
>>> def str_to_arr(my_list):
...     my_list = my_list.split(",")
...     return '[' + ','.join([str(elem) for elem in my_list]) + ']'
...
>>> str_to_arr_udf = udf(str_to_arr,StringType())
>>> df = df.withColumn('route_arr',str_to_arr_udf(df["route"]))
>>> df = df.drop("route")
>>> df.show()
+---+--------------------+
| id|           route_arr|
+---+--------------------+
|  1|[agakhanpark,scie...|
|  2|[agakhanpark,wynf...|
+---+--------------------+

.

Please "Accept" the answer if this helps.

.

-Aditya

View solution in original post

1 REPLY 1

avatar
Super Guru

@Alex Witte,

According to your question, you want to transform it to the below format

Col1   	Col2
1	[agakhanpark,science centre,sunnybrookpark,laird,leaside,mountpleasant,avenue]
2	[agakhanpark,wynford,sloane,oconnor,pharmacy,hakimilebovic,goldenmile,birchmount]

I have changed your code little bit and was able to achieve it. Please check this code and the pyspark execution output

from pyspark.sql.types import *
data_schema = [StructField('id', IntegerType(), False),StructField('route', StringType(),False)]
final_struc = StructType(fields=data_schema)
df = sqlContext.read.option("delimiter", "|").csv('/user/hrt_qa/a.txt',schema=final_struc)
df.show()
from pyspark.sql.functions import udf
def str_to_arr(my_list):
    my_list = my_list.split(",")
    return '[' + ','.join([str(elem) for elem in my_list]) + ']'
str_to_arr_udf = udf(str_to_arr,StringType())
df = df.withColumn('route_arr',str_to_arr_udf(df["route"]))
df = df.drop("route")
df.show()
>>> from pyspark.sql.types import *
>>> data_schema = [StructField('id', IntegerType(), False),StructField('route', StringType(),False)]
>>> final_struc = StructType(fields=data_schema)
>>> df = sqlContext.read.option("delimiter", "|").csv('/user/hrt_qa/a.txt',schema=final_struc)
>>> df.show()
+---+--------------------+
| id|               route|
+---+--------------------+
|  1|agakhanpark,scien...|
|  2|agakhanpark,wynfo...|
+---+--------------------+
>>>
>>>
>>> from pyspark.sql.functions import udf
>>> def str_to_arr(my_list):
...     my_list = my_list.split(",")
...     return '[' + ','.join([str(elem) for elem in my_list]) + ']'
...
>>> str_to_arr_udf = udf(str_to_arr,StringType())
>>> df = df.withColumn('route_arr',str_to_arr_udf(df["route"]))
>>> df = df.drop("route")
>>> df.show()
+---+--------------------+
| id|           route_arr|
+---+--------------------+
|  1|[agakhanpark,scie...|
|  2|[agakhanpark,wynf...|
+---+--------------------+

.

Please "Accept" the answer if this helps.

.

-Aditya