<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Pyspark can't show() a CSV with an array in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Pyspark-can-t-show-a-CSV-with-an-array/m-p/229619#M79594</link>
    <description>&lt;P&gt; &lt;A rel="user" href="https://community.cloudera.com/users/81308/alexanderwitte.html" nodeid="81308"&gt;@Alex Witte&lt;/A&gt;,&lt;/P&gt;&lt;P&gt;According to your question, you want to transform it to the below format&lt;/P&gt;&lt;PRE&gt;Col1   	Col2
1	[agakhanpark,science centre,sunnybrookpark,laird,leaside,mountpleasant,avenue]
2	[agakhanpark,wynford,sloane,oconnor,pharmacy,hakimilebovic,goldenmile,birchmount]&lt;/PRE&gt;&lt;P&gt;I have changed your code little bit and was able to achieve it. Please check this code and the pyspark execution output&lt;/P&gt;&lt;PRE&gt;from pyspark.sql.types import *
data_schema = [StructField('id', IntegerType(), False),StructField('route', StringType(),False)]
final_struc = StructType(fields=data_schema)
df = sqlContext.read.option("delimiter", "|").csv('/user/hrt_qa/a.txt',schema=final_struc)
df.show()
from pyspark.sql.functions import udf
def str_to_arr(my_list):
    my_list = my_list.split(",")
    return '[' + ','.join([str(elem) for elem in my_list]) + ']'
str_to_arr_udf = udf(str_to_arr,StringType())
df = df.withColumn('route_arr',str_to_arr_udf(df["route"]))
df = df.drop("route")
df.show()&lt;/PRE&gt;&lt;PRE&gt;&amp;gt;&amp;gt;&amp;gt; from pyspark.sql.types import *
&amp;gt;&amp;gt;&amp;gt; data_schema = [StructField('id', IntegerType(), False),StructField('route', StringType(),False)]
&amp;gt;&amp;gt;&amp;gt; final_struc = StructType(fields=data_schema)
&amp;gt;&amp;gt;&amp;gt; df = sqlContext.read.option("delimiter", "|").csv('/user/hrt_qa/a.txt',schema=final_struc)
&amp;gt;&amp;gt;&amp;gt; df.show()
+---+--------------------+
| id|               route|
+---+--------------------+
|  1|agakhanpark,scien...|
|  2|agakhanpark,wynfo...|
+---+--------------------+
&amp;gt;&amp;gt;&amp;gt;
&amp;gt;&amp;gt;&amp;gt;
&amp;gt;&amp;gt;&amp;gt; from pyspark.sql.functions import udf
&amp;gt;&amp;gt;&amp;gt; def str_to_arr(my_list):
...     my_list = my_list.split(",")
...     return '[' + ','.join([str(elem) for elem in my_list]) + ']'
...
&amp;gt;&amp;gt;&amp;gt; str_to_arr_udf = udf(str_to_arr,StringType())
&amp;gt;&amp;gt;&amp;gt; df = df.withColumn('route_arr',str_to_arr_udf(df["route"]))
&amp;gt;&amp;gt;&amp;gt; df = df.drop("route")
&amp;gt;&amp;gt;&amp;gt; df.show()
+---+--------------------+
| id|           route_arr|
+---+--------------------+
|  1|[agakhanpark,scie...|
|  2|[agakhanpark,wynf...|
+---+--------------------+&lt;/PRE&gt;&lt;P&gt;.&lt;/P&gt;&lt;P&gt;Please "Accept" the answer if this helps.&lt;/P&gt;&lt;P&gt;.&lt;/P&gt;&lt;P&gt;-Aditya&lt;/P&gt;</description>
    <pubDate>Fri, 15 Jun 2018 17:56:27 GMT</pubDate>
    <dc:creator>asirna</dc:creator>
    <dc:date>2018-06-15T17:56:27Z</dc:date>
    <item>
      <title>Pyspark can't show() a CSV with an array</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Pyspark-can-t-show-a-CSV-with-an-array/m-p/229618#M79593</link>
      <description>&lt;P&gt;Hi are there any tricks in reading a CSV into a dataframe and defining one of the columns as an array.  Check it out, here is my CSV file:&lt;/P&gt;&lt;PRE&gt;1|agakhanpark,science centre,sunnybrookpark,laird,leaside,mountpleasant,avenue
2|agakhanpark,wynford,sloane,oconnor,pharmacy,hakimilebovic,goldenmile,birchmount&lt;/PRE&gt;&lt;P&gt;All I want to do is transform this into a dataframe which would look something like:&lt;/P&gt;&lt;PRE&gt;Col1	Col2
1 	[agakhanpark,science centre,sunnybrookpark,laird,leaside,mountpleasant,avenue]
2 	[agakhanpark,wynford,sloane,oconnor,pharmacy,hakimilebovic,goldenmile,birchmount]&lt;/PRE&gt;&lt;P&gt;I'm able to define the dataframe was an array but when i go to show() I get a big long error.  Here's the pyspark code&lt;/P&gt;&lt;PRE&gt;data_schema = [StructField('id', IntegerType(), False),StructField('route', ArrayType(StringType()),False)]
final_struc = StructType(fields=data_schema)
spark = SparkSession.builder.appName('Alex').getOrCreate()
df = spark.read.option("delimiter", "|").csv('output2.csv',schema=final_struc)
df.show()&lt;/PRE&gt;&lt;P&gt;&lt;EM&gt;Traceback (most recent call last):&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;File "/Users/awitte/Documents/GitHub/cmx-hadoop-pipe/sparkProcess.py", line 20, in &amp;lt;module&amp;gt;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;df.show()&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;File "/usr/local/Cellar/apache-spark/2.2.0/libexec/python/pyspark/sql/dataframe.py", line 336, in show&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;print(self._jdf.showString(n, 20))&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;File "/usr/local/Cellar/apache-spark/2.2.0/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;File "/usr/local/Cellar/apache-spark/2.2.0/libexec/python/pyspark/sql/utils.py", line 63, in deco&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;return f(*a, **kw)&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;File "/usr/local/Cellar/apache-spark/2.2.0/libexec/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;py4j.protocol.Py4JJavaError: An error occurred while calling o35.showString.&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;: java.lang.UnsupportedOperationException: &lt;STRONG&gt;CSV data source does not support array&amp;lt;string&amp;gt; data type.&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;Any thoughts?&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 13:21:02 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Pyspark-can-t-show-a-CSV-with-an-array/m-p/229618#M79593</guid>
      <dc:creator>alexander_witte</dc:creator>
      <dc:date>2022-09-16T13:21:02Z</dc:date>
    </item>
    <item>
      <title>Re: Pyspark can't show() a CSV with an array</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Pyspark-can-t-show-a-CSV-with-an-array/m-p/229619#M79594</link>
      <description>&lt;P&gt; &lt;A rel="user" href="https://community.cloudera.com/users/81308/alexanderwitte.html" nodeid="81308"&gt;@Alex Witte&lt;/A&gt;,&lt;/P&gt;&lt;P&gt;According to your question, you want to transform it to the below format&lt;/P&gt;&lt;PRE&gt;Col1   	Col2
1	[agakhanpark,science centre,sunnybrookpark,laird,leaside,mountpleasant,avenue]
2	[agakhanpark,wynford,sloane,oconnor,pharmacy,hakimilebovic,goldenmile,birchmount]&lt;/PRE&gt;&lt;P&gt;I have changed your code little bit and was able to achieve it. Please check this code and the pyspark execution output&lt;/P&gt;&lt;PRE&gt;from pyspark.sql.types import *
data_schema = [StructField('id', IntegerType(), False),StructField('route', StringType(),False)]
final_struc = StructType(fields=data_schema)
df = sqlContext.read.option("delimiter", "|").csv('/user/hrt_qa/a.txt',schema=final_struc)
df.show()
from pyspark.sql.functions import udf
def str_to_arr(my_list):
    my_list = my_list.split(",")
    return '[' + ','.join([str(elem) for elem in my_list]) + ']'
str_to_arr_udf = udf(str_to_arr,StringType())
df = df.withColumn('route_arr',str_to_arr_udf(df["route"]))
df = df.drop("route")
df.show()&lt;/PRE&gt;&lt;PRE&gt;&amp;gt;&amp;gt;&amp;gt; from pyspark.sql.types import *
&amp;gt;&amp;gt;&amp;gt; data_schema = [StructField('id', IntegerType(), False),StructField('route', StringType(),False)]
&amp;gt;&amp;gt;&amp;gt; final_struc = StructType(fields=data_schema)
&amp;gt;&amp;gt;&amp;gt; df = sqlContext.read.option("delimiter", "|").csv('/user/hrt_qa/a.txt',schema=final_struc)
&amp;gt;&amp;gt;&amp;gt; df.show()
+---+--------------------+
| id|               route|
+---+--------------------+
|  1|agakhanpark,scien...|
|  2|agakhanpark,wynfo...|
+---+--------------------+
&amp;gt;&amp;gt;&amp;gt;
&amp;gt;&amp;gt;&amp;gt;
&amp;gt;&amp;gt;&amp;gt; from pyspark.sql.functions import udf
&amp;gt;&amp;gt;&amp;gt; def str_to_arr(my_list):
...     my_list = my_list.split(",")
...     return '[' + ','.join([str(elem) for elem in my_list]) + ']'
...
&amp;gt;&amp;gt;&amp;gt; str_to_arr_udf = udf(str_to_arr,StringType())
&amp;gt;&amp;gt;&amp;gt; df = df.withColumn('route_arr',str_to_arr_udf(df["route"]))
&amp;gt;&amp;gt;&amp;gt; df = df.drop("route")
&amp;gt;&amp;gt;&amp;gt; df.show()
+---+--------------------+
| id|           route_arr|
+---+--------------------+
|  1|[agakhanpark,scie...|
|  2|[agakhanpark,wynf...|
+---+--------------------+&lt;/PRE&gt;&lt;P&gt;.&lt;/P&gt;&lt;P&gt;Please "Accept" the answer if this helps.&lt;/P&gt;&lt;P&gt;.&lt;/P&gt;&lt;P&gt;-Aditya&lt;/P&gt;</description>
      <pubDate>Fri, 15 Jun 2018 17:56:27 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Pyspark-can-t-show-a-CSV-with-an-array/m-p/229619#M79594</guid>
      <dc:creator>asirna</dc:creator>
      <dc:date>2018-06-15T17:56:27Z</dc:date>
    </item>
  </channel>
</rss>

