Created 10-29-2020 11:30 PM
Hi,
I am trying to write df (length of col names are very large ~100 chars) to hive table by using below statement. I am using PySpark.
df.write.mode("overwrite").saveAsTable(TableName)
I am able to write the data to hive table when I pass the config explicitly while submitting spark job as below.
spark-submit --master yarn --deploy-mode client --conf "spark.executor.memoryOverhead=10g" --executor-memory 4g --num-executors 4 --driver-memory 10g script.py
But, my requirement is to maintain all the configs inside python script like below or updating permanently in config fiels of installation directory. When I set configs inside script I am getting java heap space error. Please find below code snippet and error trace.
spark = SparkSession\
.builder\
.master("yarn")\
.config("spark.submit.deployMode", "client")\
.config("spark.executor.instances", "4")\
.config("spark.executor.memory", "5g")\
.config("spark.driver.memory", "10g")\
.config("spark.executor.memoryOverhead", "10g")\
.appName("Application Name")\
.enableHiveSupport().getOrCreate()
Traceback (most recent call last):
----------------------------------
File "/home/dev/pipeline/script.py", line 117, in <module>
main()
File "/home/dev/pipeline/script.py", line 57, in main
flat_json_df.write.mode("overwrite").saveAsTable(TableName)
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 744, in saveAsTable
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o2390.saveAsTable.
: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.<init>(String.java:207)
at java.lang.StringBuilder.toString(StringBuilder.java:407)
at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:404)
at com.fasterxml.jackson.core.io.SegmentedStringWriter.getAndClear(SegmentedStringWriter.java:83)
at com.fasterxml.jackson.databind.ObjectWriter.writeValueAsString(ObjectWriter.java:999)
at org.json4s.jackson.JsonMethods$class.pretty(JsonMethods.scala:38)
at org.json4s.jackson.JsonMethods$.pretty(JsonMethods.scala:50)
at org.apache.spark.sql.catalyst.trees.TreeNode.prettyJson(TreeNode.scala:600)
at com.microsoft.peregrine.spark.listeners.PlanLogListener.logPlans(PlanLogListener.java:68)
at com.microsoft.peregrine.spark.listeners.PlanLogListener.onSuccess(PlanLogListener.java:57)
at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1$$anonfun$apply$mcV$sp$1.apply(QueryExecutionListener.scala:124)
at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1$$anonfun$apply$mcV$sp$1.apply(QueryExecutionListener.scala:123)
at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$org$apache$spark$sql$util$ExecutionListenerManager$$withErrorHandling$1.apply(QueryExecutionListener.scala:145)
at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$org$apache$spark$sql$util$ExecutionListenerManager$$withErrorHandling$1.apply(QueryExecutionListener.scala:143)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
at org.apache.spark.sql.util.ExecutionListenerManager.org$apache$spark$sql$util$ExecutionListenerManager$$withErrorHandling(QueryExecutionListener.scala:143)
at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1.apply$mcV$sp(QueryExecutionListener.scala:123)
at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1.apply(QueryExecutionListener.scala:123)
at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1.apply(QueryExecutionListener.scala:123)
at org.apache.spark.sql.util.ExecutionListenerManager.readLock(QueryExecutionListener.scala:156)
at org.apache.spark.sql.util.ExecutionListenerManager.onSuccess(QueryExecutionListener.scala:122)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:658)
at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:458)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:437)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:393)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
Just wondering if any one face this same issue, If so can you please help how to set configs inside the script or Do I need to update them in config files directly in installation directory.
Created 10-31-2020 06:18 AM
In Nifi, There is a bootstrap.conf file inside the conf folder.
You can update the value of the following properties,
# JVM memory settings
java.arg.2=-Xms2g
java.arg.3=-Xmx8g