Created 06-11-2018 02:19 PM
Hi,
I am writing spark dataframe into parquet hive table like below
df.write.format("parquet").mode("append").insertInto("my_table")
But when i go to HDFS and check for the files which are created for hive table i could see that files are not created with .parquet extension. Files are created with .c000 extension.
Again i am not sure whether my data is correctly written into table or not(I could see the data from hive select). How we should write the data into .parquet files into hive table?
Appreciate your help on this!
Thanks,
Created 06-12-2018 11:26 AM
Hi,
I found the correct way to do it, there is no need to do any workaround we can directly append the data into parquet hive table using saveAsTable("mytable") from spark 2.0 (Was not there in spark 1.6)
Below is the code in case someone needs it.
df.write.paritionedBy("mycol1","mycol2").mode(SaveMode.Append).format("parquet").saveAsTable("myhivetable")
In case table is not there it will create it and write the data into hive table.
In case table is there it will append the data into hive table and specified partitions.
Created 06-11-2018 02:24 PM
Please run a describe on the hive table. If it shows data storage as parquet, then you're good.
More info on describe here
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Describe
Created 06-11-2018 03:11 PM
@sunile.manjee thanks for your response.
Hive table has input format, output format and serde as ParquetHiveSerDe, however my concern is why files are not created with .parquet extension and whenever i do cat on those .c000 files i am unable to find parquet schema which i could find after cat of normal .parquet files.
Created 06-11-2018 03:24 PM
".c This is file counter which means the number of files that have been written in the past for this specific partition". The schema is stored in hive metastore. If want native parquet files with schema, why not store on hdfs and create hive external table?
Created 06-11-2018 04:08 PM
@sunile.manjee there might be multiple workaround for this, however i am not looking for workarounds. Expecting something concrete solution which should not have performance complications. We have option to write dataframe into hive table straight a way...why should we not go for that...instead of writing data into hdfs and then loading into hive table...moreover my hive table is partitioned on processing year and month...
Created 06-12-2018 11:26 AM
Hi,
I found the correct way to do it, there is no need to do any workaround we can directly append the data into parquet hive table using saveAsTable("mytable") from spark 2.0 (Was not there in spark 1.6)
Below is the code in case someone needs it.
df.write.paritionedBy("mycol1","mycol2").mode(SaveMode.Append).format("parquet").saveAsTable("myhivetable")
In case table is not there it will create it and write the data into hive table.
In case table is there it will append the data into hive table and specified partitions.
Created 06-04-2020 08:28 AM
Probably worth pointing out that the behaviour of insertInto & saveAsTable can differ under certain conditions:
https://towardsdatascience.com/understanding-the-spark-insertinto-function-1870175c3ee9