I have a Hive table which has to be updated quite often, hence I created it as a "Transactional Table " in ORC Format. I am trying to create a parquet file from it using the following commands in spark:
val q= spark.sql("select * from hive_table") q.write.parquet("hive_table.parquet")
I above commands worked for non-transactional tables but don't work for this particular table. I have to have this read into a parquet file because we are using Spark for querying that table (along with many others)
Someone please suggest how to do this. Are there any parameters to be set in Spark? I am not even able to do a simple "select count(*) from db_name.hive_table" query on this table from spark. I can do it in Hive after I set some parameters.
Curious why you think you need data in Parquet if it is already in ORC.
Spark can read ORC as a DataSet directly out of a Hive table
val df = spark.read("hive_table")
Or, if not using Hive, and only HDFS
val df = spark.read.format("orc").load("/path/to/orc/hive_table")
Either case "df.count" should return you the size of that dataset
Also, you can refer this (though it's for Spark 1.x). https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_spark-component-guide/content/orc-spark....