Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

HIVE / SparkSQL '.parquet not a SequenceFile '

avatar
New Contributor

Using the sandbox I have saved a parquet file as a table with:

df.write.format('parquet').mode('overwrite').saveAsTable(myfile)

followed by:

sqlContext.refreshTable(myfile)

when I attempt to query the file with SparkSQL or Hive I get the error:

{"message":"H170 Unable to fetch results. java.io.IOException: java.io.IOException: hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/myfile/part-r-00000-5dc24bf0-23ef-4f3c-a1fc-42928761592d.gz.parquet not a SequenceFile [ERROR_STATUS]","status":500,"trace":"org.apache.ambari.view.hive.client.HiveErrorStatusException: H170 Unable to fetch results. java.io.IOException: java.io.IOException:

....

This issue started after I had first replaced the parquet file underlying the original df and attempted to rebuild.

When I run df.head(10) I can see the dataframe.

I have attempted manually deleting the parquet and the Hive files under the warehouse, even after they are deleted when I resave the table the issue occurs.

I have sqlContext.setConf("spark.sql.hive.convertMetastoreParquet", "false")

I have tried os.environ["HADOOP_USER_NAME"] = 'hdfs'

I have tried unpersisting the dataframe

I have tried changing the permissions with os.system('hdfs fs -chmod -R 777 hdfs://apps/hive/warehouse')

I can't seem to clear out this issue. I have seen resolutions with the above but none have helped me. I can't seem to get back to being able to access the data via Hive or SparkSQL.

1 ACCEPTED SOLUTION

avatar

This is a long shot, but I had some trouble with Parquet and Hive in the past and one change that fixed my problem was the switch to ORC. The new Spark Version does support ORC files and Hive is optimized towards ORC. Could you save your data as ORC and run your spark sql again?

df.write.format("orc")...

View solution in original post

5 REPLIES 5

avatar

This is a long shot, but I had some trouble with Parquet and Hive in the past and one change that fixed my problem was the switch to ORC. The new Spark Version does support ORC files and Hive is optimized towards ORC. Could you save your data as ORC and run your spark sql again?

df.write.format("orc")...

avatar
Master Mentor

@Francis McGregor-Macdonald are you still having issues with this? Can you accept best answer or provide your own solution?

avatar
New Contributor

@Artem Ervits yes still having the issue. I have moved on to other things though. What is the correct response in that scenario?

avatar
Master Mentor

@Francis McGregor-Macdonald correct response is to call in the big guns :). @vshukla @Ram Sriharsha

avatar

Think this may be related to https://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3CCAAswR-5=az1SPxo8EaQvOs2JMh=V82z...

What is your spark-shell mode? Yarn-cluster or yarn-client?