Created 12-29-2017 04:25 PM
Please Help.
I created a empty external table ORC format with 2 partitions in hive through cli
I then loginto pyspark shell and run the following code
from pyspark.sql import SparkSession spark = SparkSession.builder \ .enableHiveSupport() \ .config("hive.exec.dynamic.partition", "true") \ .config("hive.exec.dynamic.partition.mode", "nonstrict") \ .config("hive.exec.max.dynamic.partitions", "3000") \ .getOrCreate() from pyspark.sql import HiveContext sqlContext = HiveContext(sc) **** ETL #just to be save one more time sqlContext.setConf("hive.exec.dynamic.partition", "true") sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") union_df.write.mode("append") \ .insertInto("scratch.daily_test",overwrite = False)
The Above code fails because the number of partitions in is > 1000
pyspark.sql.utils.AnalysisException: u'org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions created is 2905, which is more than 1000. To solve this try to set hive.exec.max.dynamic.partitions to at least 2905
Which is why I set the max partitions to 3000 in the session builder.
I also checked hive-site.xml max partitions are 5000 ( this is in hive-client/conf/hive-site.xml)
<property> <name>hive.exec.dynamic.partition</name> <value>true</value> </property> <property> <name>hive.exec.dynamic.partition.mode</name> <value>nonstrict</value> </property> <property> <name>hive.exec.max.dynamic.partitions</name> <value>5000</value> </property>
When I ran
union_df.write.mode("append").format("orc").partitionBy("country","date_str").saveAsTable("scratch.daily_test") I Get the following pyspark.sql.utils.AnalysisException: u'Saving data in the Hive serde table scratch.daily_test is not supported yet. Please use the insertInto() API as an alternative.;'
When I run
union_df.write.mode("append").format("orc").partitionBy("country","date_str").insertInto("scratch.daily_test") I get pyspark.sql.utils.AnalysisException: u"insertInto() can't be used together with partitionBy(). Partition columns have already be defined for the table. It is not necessary to use partitionBy().;"
As of Now the following works but it overwrites the entire External structure to Parquet
union_df.write.mode("overwrite").partitionBy("country","date_str").saveAsTable("scratch.daily_test")
Questions:
1. How do you insert into table with ORC format with partitions ?
2. How to work around the hive.exec.max.dynamic.partitions ?
Please let me know if you need any additional details
Created 12-29-2017 08:16 PM
I added same configs to spark2-client/conf/hive-site.xml and the following worked.
union_df.write.mode("append") \ .insertInto("scratch.daily_test",overwrite = False)
Created 12-29-2017 08:16 PM
I added same configs to spark2-client/conf/hive-site.xml and the following worked.
union_df.write.mode("append") \ .insertInto("scratch.daily_test",overwrite = False)