Hi Folks,
I'm looking for a bit of background on an error that I keep running into. I have tried to research this but have hit a dead end. My job stops in CDSW with just the message: Error: Output cell exceeds limit of 9.99 megabytes.No traceback and all I have as a possible lead is a "Failed displaying message, data too large data = {"type":"error"}" message in the logs tab - I am connecting to an oracle database but taking a sample subset of the data but do not know enough about the error to determine if that is the reason. Am doing large joins for example with no issue - the subset is 100's of rows in size and the whole thing would fit into an Excel file (as I said above it is a sample set).
I have a pyspark dataframe with columns, for example: id, full_name, email_smtp and time_stamp. For each group of id, full_name and email_smtp I want to get the min/max of the time_stamp. I most recently encountered this issue when I created a window and partitioned by id, full_name, email_smtp and then created a column with the min and max values of time_stamp.
I could group to get around this issue but have encountered it with a select statement as well and so not sure if it is related to the window function or is that just how the error is manifesting now.
The last time I encountered this issue I adjusted spark.yarn.executor.memoryOverhead from the default to 1g and that worked however that has since stopped working and increasing this value has no effect - I still do not understand the error or why memoryOverhead would clear it temporarily.
Running Spark 2.3 on a Hadoop cluster through CDSW (version 1.7.2) and have the following driver and executor settings (these have previously worked)
.config("spark.driver.memory", "2g")
.config("spark.executor.memory", "4g")
Any background would be much appreciated,
Gavin