Member since
10-22-2021
1
Post
0
Kudos Received
0
Solutions
10-22-2021
05:50 AM
Hi Folks, I'm looking for a bit of background on an error that I keep running into. I have tried to research this but have hit a dead end. My job stops in CDSW with just the message: Error: Output cell exceeds limit of 9.99 megabytes.No traceback and all I have as a possible lead is a "Failed displaying message, data too large data = {"type":"error"}" message in the logs tab - I am connecting to an oracle database but taking a sample subset of the data but do not know enough about the error to determine if that is the reason. Am doing large joins for example with no issue - the subset is 100's of rows in size and the whole thing would fit into an Excel file (as I said above it is a sample set). I have a pyspark dataframe with columns, for example: id, full_name, email_smtp and time_stamp. For each group of id, full_name and email_smtp I want to get the min/max of the time_stamp. I most recently encountered this issue when I created a window and partitioned by id, full_name, email_smtp and then created a column with the min and max values of time_stamp. I could group to get around this issue but have encountered it with a select statement as well and so not sure if it is related to the window function or is that just how the error is manifesting now. The last time I encountered this issue I adjusted spark.yarn.executor.memoryOverhead from the default to 1g and that worked however that has since stopped working and increasing this value has no effect - I still do not understand the error or why memoryOverhead would clear it temporarily. Running Spark 2.3 on a Hadoop cluster through CDSW (version 1.7.2) and have the following driver and executor settings (these have previously worked) .config("spark.driver.memory", "2g") .config("spark.executor.memory", "4g") Any background would be much appreciated, Gavin
... View more
Labels: