Support Questions

rahulsmtauti · ‎06-11-2018

Hi,

I am writing spark dataframe into parquet hive table like below

df.write.format("parquet").mode("append").insertInto("my_table")

But when i go to HDFS and check for the files which are created for hive table i could see that files are not created with .parquet extension. Files are created with .c000 extension.

Again i am not sure whether my data is correctly written into table or not(I could see the data from hive select). How we should write the data into .parquet files into hive table?

Appreciate your help on this!

Thanks,

rahulsmtauti · ‎06-12-2018

Hi,

I found the correct way to do it, there is no need to do any workaround we can directly append the data into parquet hive table using saveAsTable("mytable") from spark 2.0 (Was not there in spark 1.6)

Below is the code in case someone needs it.

df.write.paritionedBy("mycol1","mycol2").mode(SaveMode.Append).format("parquet").saveAsTable("myhivetable")

In case table is not there it will create it and write the data into hive table.

In case table is there it will append the data into hive table and specified partitions.

View solution in original post

sunile_manjee · ‎06-11-2018

Please run a describe on the hive table. If it shows data storage as parquet, then you're good.

More info on describe here

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Describe

rahulsmtauti · ‎06-11-2018

@sunile.manjee thanks for your response.

Hive table has input format, output format and serde as ParquetHiveSerDe, however my concern is why files are not created with .parquet extension and whenever i do cat on those .c000 files i am unable to find parquet schema which i could find after cat of normal .parquet files.

sunile_manjee · ‎06-11-2018

".c This is file counter which means the number of files that have been written in the past for this specific partition". The schema is stored in hive metastore. If want native parquet files with schema, why not store on hdfs and create hive external table?

rahulsmtauti · ‎06-11-2018

@sunile.manjee there might be multiple workaround for this, however i am not looking for workarounds. Expecting something concrete solution which should not have performance complications. We have option to write dataframe into hive table straight a way...why should we not go for that...instead of writing data into hdfs and then loading into hive table...moreover my hive table is partitioned on processing year and month...

rahulsmtauti · ‎06-12-2018

Hi,

I found the correct way to do it, there is no need to do any workaround we can directly append the data into parquet hive table using saveAsTable("mytable") from spark 2.0 (Was not there in spark 1.6)

Below is the code in case someone needs it.

df.write.paritionedBy("mycol1","mycol2").mode(SaveMode.Append).format("parquet").saveAsTable("myhivetable")

In case table is not there it will create it and write the data into hive table.

In case table is there it will append the data into hive table and specified partitions.

jamiet · ‎06-04-2020

Probably worth pointing out that the behaviour of insertInto & saveAsTable can differ under certain conditions:

https://towardsdatascience.com/understanding-the-spark-insertinto-function-1870175c3ee9

https://stackoverflow.com/questions/47844808/what-are-the-differences-between-saveastable-and-insert...

Cloudera Community

Support Questions

Write dataframe into parquet hive table ended with .c000 file underneath hdfs