Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Write dataframe into parquet hive table ended with .c000 file underneath hdfs

avatar

Hi,

I am writing spark dataframe into parquet hive table like below

df.write.format("parquet").mode("append").insertInto("my_table")

But when i go to HDFS and check for the files which are created for hive table i could see that files are not created with .parquet extension. Files are created with .c000 extension.

Again i am not sure whether my data is correctly written into table or not(I could see the data from hive select). How we should write the data into .parquet files into hive table?

Appreciate your help on this!

Thanks,

1 ACCEPTED SOLUTION

avatar

Hi,

I found the correct way to do it, there is no need to do any workaround we can directly append the data into parquet hive table using saveAsTable("mytable") from spark 2.0 (Was not there in spark 1.6)

Below is the code in case someone needs it.

df.write.paritionedBy("mycol1","mycol2").mode(SaveMode.Append).format("parquet").saveAsTable("myhivetable")

In case table is not there it will create it and write the data into hive table.

In case table is there it will append the data into hive table and specified partitions.

View solution in original post

6 REPLIES 6

avatar
Master Guru

Please run a describe on the hive table. If it shows data storage as parquet, then you're good.

More info on describe here

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Describe

avatar

@sunile.manjee thanks for your response.

Hive table has input format, output format and serde as ParquetHiveSerDe, however my concern is why files are not created with .parquet extension and whenever i do cat on those .c000 files i am unable to find parquet schema which i could find after cat of normal .parquet files.

avatar
Master Guru

".c This is file counter which means the number of files that have been written in the past for this specific partition". The schema is stored in hive metastore. If want native parquet files with schema, why not store on hdfs and create hive external table?

avatar

@sunile.manjee there might be multiple workaround for this, however i am not looking for workarounds. Expecting something concrete solution which should not have performance complications. We have option to write dataframe into hive table straight a way...why should we not go for that...instead of writing data into hdfs and then loading into hive table...moreover my hive table is partitioned on processing year and month...

avatar

Hi,

I found the correct way to do it, there is no need to do any workaround we can directly append the data into parquet hive table using saveAsTable("mytable") from spark 2.0 (Was not there in spark 1.6)

Below is the code in case someone needs it.

df.write.paritionedBy("mycol1","mycol2").mode(SaveMode.Append).format("parquet").saveAsTable("myhivetable")

In case table is not there it will create it and write the data into hive table.

In case table is there it will append the data into hive table and specified partitions.

avatar
Explorer