Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

pyspark dropDuplicates java.lang.UnsupportedOperationException: org.apache.parquet…PlainValuesDictionary$PlainBinaryDictionary

pyspark dropDuplicates java.lang.UnsupportedOperationException: org.apache.parquet…PlainValuesDictionary$PlainBinaryDictionary

Rising Star

In pyspark, am successfully reading a folder of parquet files from hdfs and able to .show() and .count(), but getting error

20/01/31 18:15:42 WARN TaskSetManager: Lost task 8.0 in stage 4.0 (TID 19, hw005.ucera.local, executor 2): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary

        at org.apache.parquet.column.Dictionary.decodeToLong(Dictionary.java:52)

        at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:36)

when trying to dropDuplicates():

df = sparkSession.read \
    .option("header", "true") \
    .parquet("hdfs:///some/path/to/files/*.parquet")
df.show(n=3)
had = df.count() print("\ndataframe had %s rows\n" % had)
df = df.dropDuplicates() print("\ndataframe now has %s rows after dedup job\n" % df.count())

 

May be useful to note that I am using the following configs in the spark-submit command...

--conf "spark.driver.extraJavaOptions=-XX:+UseG1GC" \
--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC" \
--conf "spark.sql.autoBroadcastJoinThreshold=-1"

 

Have seen articles about this error occurring when schemas of parquets in the loaded-in folder do not match, but here I am able to initially read to do some operations before getting the error.

Anyone with more pyspark experience knows what could be happening here and how to fix it? Any debugging advice (very new to dealing with underlying spark stuff)?

Don't have an account?
Coming from Hortonworks? Activate your account here