- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Java heap issue
- Labels:
-
Apache Spark
Created on 05-31-2017 02:43 PM - edited 09-16-2022 04:40 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am trying to access hive parquet table and load it to a Pandas data frame . I am using pyspark and my code is as below :
import pyspark
import pandas
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
conf = (SparkConf().set("spark.driver.maxResultSize", "10g").setAppName("buyclick").setMaster('yarn-client').set("spark.driver.memory", "4g").set("spark.driver.cores","4").set("spark.executor.memory", "4g").set("spark.executor.cores","4").set("spark.executor.extraJavaOptions","-XX:-UseCompressedOops"))
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
results = sqlContext.sql("select * from buy_click_p")
res_pdf = results.toPandas()
This has failed continuosly what so ever I change to conf parameters and eveytime it fails as Java heap issue :
Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: Java heap space
Here are some other information about environment:
Clodera CDH version : 5.9.0
Hiver version : 1.1.0
Spark Version : 1.6.0
Hive table size : hadoop fs -du -s -h /path/to/hive/table/folder --> 381.6 M 763.2 M
Free memory on box : free -m
total used free shared buffers cached
Mem: 23545 11721 11824 12 258 1773
Please help me out and let me know if any more information is needed.
-Rahul
Created 06-01-2017 02:05 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My original issue of heap space is now fixed , seems my driver memory was not optimum . Setting driver memory from pyspark client does not take effect as container is already created by that time , thus I had to set it at spark environmerent properties in CDH manager console. To set that I went to Cloudera Manager > Spark > Configuration > Gateway > Advanced > in Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-defaults.conf I added spark.driver.memory=10g and Java heap issue was solved . I think this will work when you're running your spark application on Yarn-Client.
However after spark job is finished the application hangs on toPandas , does anyone has any idea what specific properties need to set for conversion of dataframe toPandas ?
-Rahul
