Spark 2.1 Hive ORC saveAsTable pyspark

sanjeevreddy009 — Sat, 30 Dec 2017 00:25:45 GMT

Please Help.

I created a empty external table ORC format with 2 partitions in hive through cli

I then loginto pyspark shell and run the following code

from pyspark.sql import SparkSession
spark = SparkSession.builder \
                    .enableHiveSupport() \
                    .config("hive.exec.dynamic.partition", "true") \
                    .config("hive.exec.dynamic.partition.mode", "nonstrict") \
                    .config("hive.exec.max.dynamic.partitions", "3000") \
                    .getOrCreate()

from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

**** ETL 
#just to be save one more time 
sqlContext.setConf("hive.exec.dynamic.partition", "true")
sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
union_df.write.mode("append") \
        .insertInto("scratch.daily_test",overwrite = False)

The Above code fails because the number of partitions in is > 1000

 pyspark.sql.utils.AnalysisException: u'org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions created is 2905, which is more than 1000. To solve this try to set hive.exec.max.dynamic.partitions to at least 2905

Which is why I set the max partitions to 3000 in the session builder.

I also checked hive-site.xml max partitions are 5000 ( this is in hive-client/conf/hive-site.xml)

 <property>
      <name>hive.exec.dynamic.partition</name>
      <value>true</value>
    </property>


    <property>
      <name>hive.exec.dynamic.partition.mode</name>
      <value>nonstrict</value>
    </property>

    <property>
      <name>hive.exec.max.dynamic.partitions</name>
      <value>5000</value>
    </property>

When I ran

 union_df.write.mode("append").format("orc").partitionBy("country","date_str").saveAsTable("scratch.daily_test")

I Get the following
pyspark.sql.utils.AnalysisException:
u'Saving data in the Hive serde table scratch.daily_test is not supported yet.
Please use the insertInto() API as an alternative.;'

When I run

union_df.write.mode("append").format("orc").partitionBy("country","date_str").insertInto("scratch.daily_test")

I get pyspark.sql.utils.AnalysisException:
u"insertInto() can't be used together with partitionBy(). Partition
columns have already be defined for the table. It is not necessary to use
partitionBy().;"

As of Now the following works but it overwrites the entire External structure to Parquet

 union_df.write.mode("overwrite").partitionBy("country","date_str").saveAsTable("scratch.daily_test")

Questions:

1. How do you insert into table with ORC format with partitions ?

2. How to work around the hive.exec.max.dynamic.partitions ?

Please let me know if you need any additional details

Re: Spark 2.1 Hive ORC saveAsTable pyspark

sanjeevreddy009 — Sat, 30 Dec 2017 04:16:34 GMT

I added same configs to spark2-client/conf/hive-site.xml and the following worked.

union_df.write.mode("append") \
        .insertInto("scratch.daily_test",overwrite = False)

question Re: Spark 2.1 Hive ORC saveAsTable pyspark in Archives of Support Questions (Read Only)

Spark 2.1 Hive ORC saveAsTable pyspark

Re: Spark 2.1 Hive ORC saveAsTable pyspark