Member since
06-27-2016
6
Posts
1
Kudos Received
0
Solutions
06-01-2017
02:05 PM
My original issue of heap space is now fixed , seems my driver memory was not optimum . Setting driver memory from pyspark client does not take effect as container is already created by that time , thus I had to set it at spark environmerent properties in CDH manager console. To set that I went to Cloudera Manager > Spark > Configuration > Gateway > Advanced > in Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-defaults.conf I added spark.driver.memory=10g and Java heap issue was solved . I think this will work when you're running your spark application on Yarn-Client. However after spark job is finished the application hangs on toPandas , does anyone has any idea what specific properties need to set for conversion of dataframe toPandas ? -Rahul
... View more
05-31-2017
02:43 PM
Hi, I am trying to access hive parquet table and load it to a Pandas data frame . I am using pyspark and my code is as below : import pyspark import pandas from pyspark import SparkConf from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql import HiveContext conf = (SparkConf().set("spark.driver.maxResultSize", "10g").setAppName("buyclick").setMaster('yarn-client').set("spark.driver.memory", "4g").set("spark.driver.cores","4").set("spark.executor.memory", "4g").set("spark.executor.cores","4").set("spark.executor.extraJavaOptions","-XX:-UseCompressedOops")) sc = SparkContext(conf=conf) sqlContext = HiveContext(sc) results = sqlContext.sql("select * from buy_click_p") res_pdf = results.toPandas() This has failed continuosly what so ever I change to conf parameters and eveytime it fails as Java heap issue : Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: Java heap space Here are some other information about environment: Clodera CDH version : 5.9.0 Hiver version : 1.1.0 Spark Version : 1.6.0 Hive table size : hadoop fs -du -s -h /path/to/hive/table/folder --> 381.6 M 763.2 M Free memory on box : free -m total used free shared buffers cached Mem: 23545 11721 11824 12 258 1773 Please help me out and let me know if any more information is needed. -Rahul
... View more
Labels:
- Labels:
-
Apache Spark
01-25-2017
01:26 PM
http://blog.clairvoyantsoft.com/2016/11/installing-sparkr-on-a-hadoop-cluster/ worked fine for me as well . Just few things I had to do extra : 1. In the testing section when I typed sparkR , it errored out . Seems you'll have to create links for that to work . In my case I had CDH parcel installation , thus I created below two links , and it worked fine therefater : # cp /usr/bin/sparkR /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/bin/ # rm /usr/bin/sparkR # cd /etc/alternatives/ # ln -s /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/bin/sparkR sparkR # cd /usr/bin # ln -s /etc/alternatives/sparkR sparkR # sparkR
... View more
07-06-2016
03:33 PM
1 Kudo
Late in answering though , the issue happenes because PostGRESql or any other database has not been set , and CDH server tries to look for default mysql database . To resolve it I followed below steps : 1. Installed PostGRESql database following steps in this link : http://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_embed_pstgrs.html#cmig_topic_5_7 2. The above link however assumes that you do not need to configure the database , but if you're following Path B , you still need to configure the database like an extenal PostGRESQl database instructions . To configure I followed steps in this link : http://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_extrnl_pstgrs.html#cmig_topic_5_6_1 3. Once it's configured then you run this command to initilize the parameters files : $ /usr/share/cmf/schema/scm_prepare_database.sh postgresql scm scm scm After this once you start the the server using command : sudo service cloudera-scm-server start You should be able to open the cloudera manager server using URL http://<host>:7180 . Regards Rahul
... View more