Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Inserting into exiting partitioned Hive table using HiveContext in pyspark

avatar
New Contributor

I Am trying to get data-set from a existing non partitioned hive table and trying an insert into partitioned Hive external table. How do i do that in Pyspark Sql.? Any help would be appreciated,

I am currently using the below command. The Hive External table has multiple partitions.

df.write.mode("overwrite").partitionBy("col1","col2").insertInto("Hive external Partitioned Table")

The spark job is running successfully but no data is written to the HDFS partitions of the Hive external table.

1 ACCEPTED SOLUTION

avatar
Master Guru

@Vignesh Asokan

1.Do

hive# desc formatted <hive-external-partitioned-table>;

get the Location details from desc formatted statement and do

bash$ hdfs dfs -ls <hdfs-location>

Check is there any partitions created (or) not.

2.In pyspark shell after executing below statement

df.write.mode("overwrite").partitionBy("col1","col2").insertInto("Hive external Partitioned Table")

in pyspark shell logs shows where the partition directory is creating in HDFS location

Example:-

18/01/13 17:47:52 INFO FileUtils: Creating directory if it doesn't exist: hdfs://******/apps/hive/warehouse/partition_table/daily=2017-12-23
As you can see above in pyspark shell logs it has shown creating directory in /apps/hive/warehouse/partition_table/daily=2017-12-23,

my table name is partition_table and having daily is the partition column spark has created partition in HDFS /apps/hive/warehouse/partition_table/ directory.

if you are not able to figure out the issue share more details (pyspark shell logs, table location details and statements that you are executing in pyspark shell).

View solution in original post

3 REPLIES 3

avatar
Master Guru

@Vignesh Asokan

1.Do

hive# desc formatted <hive-external-partitioned-table>;

get the Location details from desc formatted statement and do

bash$ hdfs dfs -ls <hdfs-location>

Check is there any partitions created (or) not.

2.In pyspark shell after executing below statement

df.write.mode("overwrite").partitionBy("col1","col2").insertInto("Hive external Partitioned Table")

in pyspark shell logs shows where the partition directory is creating in HDFS location

Example:-

18/01/13 17:47:52 INFO FileUtils: Creating directory if it doesn't exist: hdfs://******/apps/hive/warehouse/partition_table/daily=2017-12-23
As you can see above in pyspark shell logs it has shown creating directory in /apps/hive/warehouse/partition_table/daily=2017-12-23,

my table name is partition_table and having daily is the partition column spark has created partition in HDFS /apps/hive/warehouse/partition_table/ directory.

if you are not able to figure out the issue share more details (pyspark shell logs, table location details and statements that you are executing in pyspark shell).

avatar
New Contributor

Hi

Thanks for the answer, it works.

Thanks

Vignesh Asokan

avatar
Explorer

Dear Sir, i want to enter python calculated values which are saved in df4 should get stored a column of ''per" per is name of column in hive table .

df2 = pd.read_sql("SELECT * FROM ggg where clientcode='55'", conn)

this is how i am reading the data how to write df4  python calculated values in "per" column of hive table.