Support Questions

Find answers, ask questions, and share your expertise

java.lang.OutOfMemoryError: Java heap space - Exception while writing data to hive from dataframe using pyspark

avatar

Hi,

I am trying to write df (length of col names are very large ~100 chars) to hive table by using below statement. I am using PySpark.

df.write.mode("overwrite").saveAsTable(TableName)

I am able to write the data to hive table when I pass the config explicitly while submitting spark job as below.

spark-submit --master yarn --deploy-mode client --conf "spark.executor.memoryOverhead=10g" --executor-memory 4g --num-executors 4 --driver-memory 10g script.py

But, my requirement is to maintain all the configs inside python script like below or updating permanently in config fiels of installation directory. When I set configs inside script I am getting java heap space error. Please find below code snippet and error trace.

spark = SparkSession\
.builder\
.master("yarn")\
.config("spark.submit.deployMode", "client")\
.config("spark.executor.instances", "4")\
.config("spark.executor.memory", "5g")\
.config("spark.driver.memory", "10g")\
.config("spark.executor.memoryOverhead", "10g")\
.appName("Application Name")\
.enableHiveSupport().getOrCreate()

Traceback (most recent call last):
----------------------------------
File "/home/dev/pipeline/script.py", line 117, in <module>
main()
File "/home/dev/pipeline/script.py", line 57, in main
flat_json_df.write.mode("overwrite").saveAsTable(TableName)
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 744, in saveAsTable
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o2390.saveAsTable.
: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.<init>(String.java:207)
at java.lang.StringBuilder.toString(StringBuilder.java:407)
at com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:404)
at com.fasterxml.jackson.core.io.SegmentedStringWriter.getAndClear(SegmentedStringWriter.java:83)
at com.fasterxml.jackson.databind.ObjectWriter.writeValueAsString(ObjectWriter.java:999)
at org.json4s.jackson.JsonMethods$class.pretty(JsonMethods.scala:38)
at org.json4s.jackson.JsonMethods$.pretty(JsonMethods.scala:50)
at org.apache.spark.sql.catalyst.trees.TreeNode.prettyJson(TreeNode.scala:600)
at com.microsoft.peregrine.spark.listeners.PlanLogListener.logPlans(PlanLogListener.java:68)
at com.microsoft.peregrine.spark.listeners.PlanLogListener.onSuccess(PlanLogListener.java:57)
at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1$$anonfun$apply$mcV$sp$1.apply(QueryExecutionListener.scala:124)
at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1$$anonfun$apply$mcV$sp$1.apply(QueryExecutionListener.scala:123)
at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$org$apache$spark$sql$util$ExecutionListenerManager$$withErrorHandling$1.apply(QueryExecutionListener.scala:145)
at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$org$apache$spark$sql$util$ExecutionListenerManager$$withErrorHandling$1.apply(QueryExecutionListener.scala:143)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
at org.apache.spark.sql.util.ExecutionListenerManager.org$apache$spark$sql$util$ExecutionListenerManager$$withErrorHandling(QueryExecutionListener.scala:143)
at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1.apply$mcV$sp(QueryExecutionListener.scala:123)
at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1.apply(QueryExecutionListener.scala:123)
at org.apache.spark.sql.util.ExecutionListenerManager$$anonfun$onSuccess$1.apply(QueryExecutionListener.scala:123)
at org.apache.spark.sql.util.ExecutionListenerManager.readLock(QueryExecutionListener.scala:156)
at org.apache.spark.sql.util.ExecutionListenerManager.onSuccess(QueryExecutionListener.scala:122)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:658)
at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:458)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:437)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:393)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)

Just wondering if any one face this same issue, If so can you please help how to set configs inside the script or Do I need to update them in config files directly in installation directory.

1 REPLY 1

avatar
Explorer

In Nifi, There is a bootstrap.conf file inside the conf folder. 

You can update the value of the following properties,

# JVM memory settings
java.arg.2=-Xms2g 
java.arg.3=-Xmx8g

https://community.cloudera.com/t5/Community-Articles/How-to-address-JVM-OutOfMemory-errors-in-NiFi/t...