Created 01-27-2016 11:30 AM
Using the sandbox I have saved a parquet file as a table with:
df.write.format('parquet').mode('overwrite').saveAsTable(myfile)
followed by:
sqlContext.refreshTable(myfile)
when I attempt to query the file with SparkSQL or Hive I get the error:
{"message":"H170 Unable to fetch results. java.io.IOException: java.io.IOException: hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/myfile/part-r-00000-5dc24bf0-23ef-4f3c-a1fc-42928761592d.gz.parquet not a SequenceFile [ERROR_STATUS]","status":500,"trace":"org.apache.ambari.view.hive.client.HiveErrorStatusException: H170 Unable to fetch results. java.io.IOException: java.io.IOException:
....
This issue started after I had first replaced the parquet file underlying the original df and attempted to rebuild.
When I run df.head(10) I can see the dataframe.
I have attempted manually deleting the parquet and the Hive files under the warehouse, even after they are deleted when I resave the table the issue occurs.
I have sqlContext.setConf("spark.sql.hive.convertMetastoreParquet", "false")
I have tried os.environ["HADOOP_USER_NAME"] = 'hdfs'
I have tried unpersisting the dataframe
I have tried changing the permissions with os.system('hdfs fs -chmod -R 777 hdfs://apps/hive/warehouse')
I can't seem to clear out this issue. I have seen resolutions with the above but none have helped me. I can't seem to get back to being able to access the data via Hive or SparkSQL.
Created 01-27-2016 11:36 AM
This is a long shot, but I had some trouble with Parquet and Hive in the past and one change that fixed my problem was the switch to ORC. The new Spark Version does support ORC files and Hive is optimized towards ORC. Could you save your data as ORC and run your spark sql again?
df.write.format("orc")...
Created 01-27-2016 11:36 AM
This is a long shot, but I had some trouble with Parquet and Hive in the past and one change that fixed my problem was the switch to ORC. The new Spark Version does support ORC files and Hive is optimized towards ORC. Could you save your data as ORC and run your spark sql again?
df.write.format("orc")...
Created 02-02-2016 07:36 PM
@Francis McGregor-Macdonald are you still having issues with this? Can you accept best answer or provide your own solution?
Created 02-02-2016 08:54 PM
@Artem Ervits yes still having the issue. I have moved on to other things though. What is the correct response in that scenario?
Created 02-02-2016 08:54 PM
@Francis McGregor-Macdonald correct response is to call in the big guns :). @vshukla @Ram Sriharsha
Created 02-03-2016 01:47 AM
Think this may be related to https://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3CCAAswR-5=az1SPxo8EaQvOs2JMh=V82z...
What is your spark-shell mode? Yarn-cluster or yarn-client?