I am running Spark Thrift Server on Yarn, client mode with 50 executor nodes. First I setup -Xmx=25g for driver, the STS run about 30 mins then hang. After that I increase -Xmx=40G for driver, the STS run about 1 hour then hang. I increase -Xmx=56G for driver, STS run about 2 hours then hang again. I could not keep increasing JVM heap. In all cases, I didn't see any out of memory exception in log file. It seems that when I increased JVM heap on driver STS took most of them. I dumped JVM heap and I saw SparkSession objects are biggest object (one of them is about 10G, others are about 4-6G). I don't understand why SparkSession objects are too large like that. Please:
1) Is there any suggestion to help me resolve my issue?
@tsharma: thank you very much for your response. Base on your suggestion, I googled and see this parameter seems to be setup for Spark Standalone Mode (https://spark.apache.org/docs/2.0.2/spark-standalone.html). My application is running on Yarn. Should I configure this parameter?
Yes, this takes effect on cluster mode too and dictates the memory for Spark History Server and STS daemons. Are you using HDP? If yes you should be able to set it via Ambari, else set it directly in spark-env.sh. Please do try this.
I didn't see OMM exception in STS log file. However when I added "-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintAdaptiveSizePolicy -XX:+PrintTenuringDistribution", I saw this message in gc log file "G1Ergonomics (Heap Sizing) did not expand the heap, reason: heap already fully expanded" (please see detail message below). It seems the memory is not enough, but when I increased -Xmx, STS only work a little more time and hang again. Back to my previous questions:
1) What is kept in driver memory? Why it is too large (48G) and more if I increase -Xmx? As @tsharma said STS only a gateway. I am using client mode (not cluster mode)
2) How can I sizing memory that need to configure proper for my driver in STS?
Also when you do a collect or take, the result comes to driver, your driver will throw error if the result of collect or take is more than free space. Hence it's kept large to account for that if you have big datasets. However default is set to 1G or 2G because it mainly schedules tasks working with YARN with operations being performed on executors themselves (which actually have data, can cache it and process it).
When you increase sessions, STS daemon memory shall increase too because it has to keep listening and handling sessions.
Note the -Xmx here corresponds to thrift daemon memory rather than driver memory, driver memory is taken from spark2-thriftserver/conf/spark-thrift-sparkconf.conf which internally has a symbolic link to one inside /etc.
If you don't override it there it would just pick default. So please have spark.executor.memory, spark.driver.memory defined there.
Can you get in your node, do ps -eaf | grep thrift and paste output here?
I had asked you to set SPARK_DAEMON_MEMORY=6000m ?
Thank you very much for your response @tsharma. I do not use HDP for my STS. I will follow your suggestion. I am wondering how did you calculate memory need for your cluster? Do you have any guideline plz. As you can see in my above log message, I already set memory to 48G but it seems take all my memory, if I increase it, it take all memory again ([Eden: 0.0B(2432.0M)->0.0B(2432.0M) Survivors: 0.0B->0.0B Heap: 47.5G(48.0G)->47.5G(48.0G)])
Thank you very much for your support. I changed memory to -Xmx=64g and it seems to resolved my issue. My STS is running about 27 hours for now. I will keep monitoring too see if the problem is resolved permanently. I used to setup -Xmx=25G, then 40G then 56G but STS run for a while and hang. I still do NOT know how to calculate memory needed for STS. I have about 20 user simultaneous.