Archives of Support Questions (Read Only)

alexander_witte · ‎06-15-2018

Hi are there any tricks in reading a CSV into a dataframe and defining one of the columns as an array. Check it out, here is my CSV file:

1|agakhanpark,science centre,sunnybrookpark,laird,leaside,mountpleasant,avenue
2|agakhanpark,wynford,sloane,oconnor,pharmacy,hakimilebovic,goldenmile,birchmount

All I want to do is transform this into a dataframe which would look something like:

Col1	Col2
1 	[agakhanpark,science centre,sunnybrookpark,laird,leaside,mountpleasant,avenue]
2 	[agakhanpark,wynford,sloane,oconnor,pharmacy,hakimilebovic,goldenmile,birchmount]

I'm able to define the dataframe was an array but when i go to show() I get a big long error. Here's the pyspark code

data_schema = [StructField('id', IntegerType(), False),StructField('route', ArrayType(StringType()),False)]
final_struc = StructType(fields=data_schema)
spark = SparkSession.builder.appName('Alex').getOrCreate()
df = spark.read.option("delimiter", "|").csv('output2.csv',schema=final_struc)
df.show()

Traceback (most recent call last):

File "/Users/awitte/Documents/GitHub/cmx-hadoop-pipe/sparkProcess.py", line 20, in <module>

df.show()

File "/usr/local/Cellar/apache-spark/2.2.0/libexec/python/pyspark/sql/dataframe.py", line 336, in show

print(self._jdf.showString(n, 20))

File "/usr/local/Cellar/apache-spark/2.2.0/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__

File "/usr/local/Cellar/apache-spark/2.2.0/libexec/python/pyspark/sql/utils.py", line 63, in deco

return f(*a, **kw)

File "/usr/local/Cellar/apache-spark/2.2.0/libexec/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling o35.showString.

: java.lang.UnsupportedOperationException: CSV data source does not support array<string> data type.

Any thoughts?

asirna · ‎06-15-2018

@Alex Witte,

According to your question, you want to transform it to the below format

Col1   	Col2
1	[agakhanpark,science centre,sunnybrookpark,laird,leaside,mountpleasant,avenue]
2	[agakhanpark,wynford,sloane,oconnor,pharmacy,hakimilebovic,goldenmile,birchmount]

I have changed your code little bit and was able to achieve it. Please check this code and the pyspark execution output

from pyspark.sql.types import *
data_schema = [StructField('id', IntegerType(), False),StructField('route', StringType(),False)]
final_struc = StructType(fields=data_schema)
df = sqlContext.read.option("delimiter", "|").csv('/user/hrt_qa/a.txt',schema=final_struc)
df.show()
from pyspark.sql.functions import udf
def str_to_arr(my_list):
    my_list = my_list.split(",")
    return '[' + ','.join([str(elem) for elem in my_list]) + ']'
str_to_arr_udf = udf(str_to_arr,StringType())
df = df.withColumn('route_arr',str_to_arr_udf(df["route"]))
df = df.drop("route")
df.show()

>>> from pyspark.sql.types import *
>>> data_schema = [StructField('id', IntegerType(), False),StructField('route', StringType(),False)]
>>> final_struc = StructType(fields=data_schema)
>>> df = sqlContext.read.option("delimiter", "|").csv('/user/hrt_qa/a.txt',schema=final_struc)
>>> df.show()
+---+--------------------+
| id|               route|
+---+--------------------+
|  1|agakhanpark,scien...|
|  2|agakhanpark,wynfo...|
+---+--------------------+
>>>
>>>
>>> from pyspark.sql.functions import udf
>>> def str_to_arr(my_list):
...     my_list = my_list.split(",")
...     return '[' + ','.join([str(elem) for elem in my_list]) + ']'
...
>>> str_to_arr_udf = udf(str_to_arr,StringType())
>>> df = df.withColumn('route_arr',str_to_arr_udf(df["route"]))
>>> df = df.drop("route")
>>> df.show()
+---+--------------------+
| id|           route_arr|
+---+--------------------+
|  1|[agakhanpark,scie...|
|  2|[agakhanpark,wynf...|
+---+--------------------+

.

Please "Accept" the answer if this helps.

.

-Aditya

View solution in original post

asirna · ‎06-15-2018