Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark Conflicting partition schema parquet files

SOLVED Go to solution
Highlighted

Spark Conflicting partition schema parquet files

Contributor

Hi,

 

I am using Spark 1.3.1 and my data is stored in parquet format, the parquet files have been created by Impala.

 

after i added a parition "server" to my partition schema  (it was year,month,day and now is year,month,day,server ) and now Spark is having trouwble reading the data. 

 

I get the following error: 

 

java.lang.AssertionError: assertion failed: Conflicting partition column names detected:
ArrayBuffer(year, month, day)
ArrayBuffer(year, month, day, server)

 

 

Does spark keep some data in cache/temp dirs with the old schema? which is causing a mismatch?

Any ideas on howto fix his issue?

 

directory layout sample:

 

drwxr-xr-x - impala hive 0 2015-05-19 14:02 /user/hive/queries/year=2015/month=05/day=17
drwxr-xr-x - impala hive 0 2015-05-19 14:02 /user/hive/queries/year=2015/month=05/day=17/server=ns1
drwxr-xr-x - impala hive 0 2015-05-19 14:02 /user/hive/queries/year=2015/month=05/day=18
drwxr-xr-x - impala hive 0 2015-05-19 14:02 /user/hive/queries/year=2015/month=05/day=18/server=ns1
drwxr-xr-x - impala hive 0 2015-05-20 09:01 /user/hive/queries/year=2015/month=05/day=19
drwxr-xr-x - impala hive 0 2015-05-20 09:01 /user/hive/queries/year=2015/month=05/day=19/server=ns1

 

 

complete stacktrace:

 

java.lang.AssertionError: assertion failed: Conflicting partition column names detected:
ArrayBuffer(year, month, day)
ArrayBuffer(year, month, day, server)
at scala.Predef$.assert(Predef.scala:179)
at org.apache.spark.sql.parquet.ParquetRelation2$.resolvePartitions(newParquet.scala:933)
at org.apache.spark.sql.parquet.ParquetRelation2$.parsePartitions(newParquet.scala:851)
at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$7.apply(newParquet.scala:311)
at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$7.apply(newParquet.scala:303)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:303)
at org.apache.spark.sql.parquet.ParquetRelation2.<init>(newParquet.scala:391)
at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:540)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:19)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:24)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:26)
at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:28)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:30)
at $iwC$$iwC$$iwC.<init>(<console>:32)
at $iwC$$iwC.<init>(<console>:34)
at $iwC.<init>(<console>:36)
at <init>(<console>:38)
at .<init>(<console>:42)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Spark Conflicting partition schema parquet files

Contributor

Found the problem.

 

There were some "old style" parquet files in a hidden directory named .impala_insert_staging

After removing these directories Spark could load the data.

Impala will recreate the table when i do a new insert into the table. why there were some parquet files left in that dir is not clear to me. it was some pretty old data, so maybe something went wrong during an insert a while ago.

 

 

2 REPLIES 2

Re: Spark Conflicting partition schema parquet files

Contributor

Found the problem.

 

There were some "old style" parquet files in a hidden directory named .impala_insert_staging

After removing these directories Spark could load the data.

Impala will recreate the table when i do a new insert into the table. why there were some parquet files left in that dir is not clear to me. it was some pretty old data, so maybe something went wrong during an insert a while ago.

 

 

Re: Spark Conflicting partition schema parquet files

Community Manager

Thanks for sharing your solution! :smileyhappy:



Cy Jervis, Community Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Learn more about the Cloudera Community:
Community Guidelines
How to use the forum