Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Spark 2.1 Hive ORC saveAsTable pyspark

avatar

Please Help.

I created a empty external table ORC format with 2 partitions in hive through cli

I then loginto pyspark shell and run the following code

from pyspark.sql import SparkSession
spark = SparkSession.builder \
                    .enableHiveSupport() \
                    .config("hive.exec.dynamic.partition", "true") \
                    .config("hive.exec.dynamic.partition.mode", "nonstrict") \
                    .config("hive.exec.max.dynamic.partitions", "3000") \
                    .getOrCreate()

from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

**** ETL 
#just to be save one more time 
sqlContext.setConf("hive.exec.dynamic.partition", "true")
sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
union_df.write.mode("append") \
        .insertInto("scratch.daily_test",overwrite = False)

The Above code fails because the number of partitions in is > 1000

 pyspark.sql.utils.AnalysisException: u'org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions created is 2905, which is more than 1000. To solve this try to set hive.exec.max.dynamic.partitions to at least 2905 

Which is why I set the max partitions to 3000 in the session builder.

I also checked hive-site.xml max partitions are 5000 ( this is in hive-client/conf/hive-site.xml)

 <property>
      <name>hive.exec.dynamic.partition</name>
      <value>true</value>
    </property>


    <property>
      <name>hive.exec.dynamic.partition.mode</name>
      <value>nonstrict</value>
    </property>

    <property>
      <name>hive.exec.max.dynamic.partitions</name>
      <value>5000</value>
    </property>

When I ran

 union_df.write.mode("append").format("orc").partitionBy("country","date_str").saveAsTable("scratch.daily_test")

I Get the following
pyspark.sql.utils.AnalysisException:
u'Saving data in the Hive serde table scratch.daily_test is not supported yet.
Please use the insertInto() API as an alternative.;'

When I run

union_df.write.mode("append").format("orc").partitionBy("country","date_str").insertInto("scratch.daily_test")

I get pyspark.sql.utils.AnalysisException:
u"insertInto() can't be used together with partitionBy(). Partition
columns have already be defined for the table. It is not necessary to use
partitionBy().;"

As of Now the following works but it overwrites the entire External structure to Parquet

 union_df.write.mode("overwrite").partitionBy("country","date_str").saveAsTable("scratch.daily_test")

Questions:

1. How do you insert into table with ORC format with partitions ?

2. How to work around the hive.exec.max.dynamic.partitions ?

Please let me know if you need any additional details

1 ACCEPTED SOLUTION

avatar

I added same configs to spark2-client/conf/hive-site.xml and the following worked.

union_df.write.mode("append") \
        .insertInto("scratch.daily_test",overwrite = False)

View solution in original post

1 REPLY 1

avatar

I added same configs to spark2-client/conf/hive-site.xml and the following worked.

union_df.write.mode("append") \
        .insertInto("scratch.daily_test",overwrite = False)