Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark 2.1 Hive ORC saveAsTable pyspark

avatar
New Contributor

Please Help.

I created a empty external table ORC format with 2 partitions in hive through cli

I then loginto pyspark shell and run the following code

from pyspark.sql import SparkSession
spark = SparkSession.builder \
                    .enableHiveSupport() \
                    .config("hive.exec.dynamic.partition", "true") \
                    .config("hive.exec.dynamic.partition.mode", "nonstrict") \
                    .config("hive.exec.max.dynamic.partitions", "3000") \
                    .getOrCreate()

from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

**** ETL 
#just to be save one more time 
sqlContext.setConf("hive.exec.dynamic.partition", "true")
sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
union_df.write.mode("append") \
        .insertInto("scratch.daily_test",overwrite = False)

The Above code fails because the number of partitions in is > 1000

 pyspark.sql.utils.AnalysisException: u'org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions created is 2905, which is more than 1000. To solve this try to set hive.exec.max.dynamic.partitions to at least 2905 

Which is why I set the max partitions to 3000 in the session builder.

I also checked hive-site.xml max partitions are 5000 ( this is in hive-client/conf/hive-site.xml)

 <property>
      <name>hive.exec.dynamic.partition</name>
      <value>true</value>
    </property>


    <property>
      <name>hive.exec.dynamic.partition.mode</name>
      <value>nonstrict</value>
    </property>

    <property>
      <name>hive.exec.max.dynamic.partitions</name>
      <value>5000</value>
    </property>

When I ran

 union_df.write.mode("append").format("orc").partitionBy("country","date_str").saveAsTable("scratch.daily_test")

I Get the following
pyspark.sql.utils.AnalysisException:
u'Saving data in the Hive serde table scratch.daily_test is not supported yet.
Please use the insertInto() API as an alternative.;'

When I run

union_df.write.mode("append").format("orc").partitionBy("country","date_str").insertInto("scratch.daily_test")

I get pyspark.sql.utils.AnalysisException:
u"insertInto() can't be used together with partitionBy(). Partition
columns have already be defined for the table. It is not necessary to use
partitionBy().;"

As of Now the following works but it overwrites the entire External structure to Parquet

 union_df.write.mode("overwrite").partitionBy("country","date_str").saveAsTable("scratch.daily_test")

Questions:

1. How do you insert into table with ORC format with partitions ?

2. How to work around the hive.exec.max.dynamic.partitions ?

Please let me know if you need any additional details

1 ACCEPTED SOLUTION

avatar
New Contributor

I added same configs to spark2-client/conf/hive-site.xml and the following worked.

union_df.write.mode("append") \
        .insertInto("scratch.daily_test",overwrite = False)

View solution in original post

1 REPLY 1

avatar
New Contributor

I added same configs to spark2-client/conf/hive-site.xml and the following worked.

union_df.write.mode("append") \
        .insertInto("scratch.daily_test",overwrite = False)