Support Questions
Find answers, ask questions, and share your expertise

pyspark dropDuplicates java.lang.UnsupportedOperationException: org.apache.parquet…PlainValuesDictionary$PlainBinaryDictionary

Expert Contributor

In pyspark, am successfully reading a folder of parquet files from hdfs and able to .show() and .count(), but getting error

20/01/31 18:15:42 WARN TaskSetManager: Lost task 8.0 in stage 4.0 (TID 19, hw005.ucera.local, executor 2): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary

        at org.apache.parquet.column.Dictionary.decodeToLong(Dictionary.java:52)

        at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:36)

when trying to dropDuplicates():

df = sparkSession.read \
    .option("header", "true") \
    .parquet("hdfs:///some/path/to/files/*.parquet")
df.show(n=3)
had = df.count() print("\ndataframe had %s rows\n" % had)
df = df.dropDuplicates() print("\ndataframe now has %s rows after dedup job\n" % df.count())

 

May be useful to note that I am using the following configs in the spark-submit command...

--conf "spark.driver.extraJavaOptions=-XX:+UseG1GC" \
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC" \
--conf "spark.sql.autoBroadcastJoinThreshold=-1"

 

Have seen articles about this error occurring when schemas of parquets in the loaded-in folder do not match, but here I am able to initially read to do some operations before getting the error.

Anyone with more pyspark experience knows what could be happening here and how to fix it? Any debugging advice (very new to dealing with underlying spark stuff)?