Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

spark job unable to Execute in yarn-cluster mode

Highlighted

spark job unable to Execute in yarn-cluster mode

New Contributor

58420-2.png

I am using Spark version 1.6.0 , Python version 2.6.6,Hadoop 2.7.1.2.4.0.0-169

I have a pyspark script as:

conf = SparkConf().setAppName("Log Analysis")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

loadFiles=sc.wholeTextFiles("hdfs:///locations")

fileWiseData=loadFiles.flatMap(lambda inpFile : inpFile[1].split("\n\n"))
replaceNewLine=fileWiseData.map(lambda lines:lines.replace("\n",""))
filterLines=replaceNewLine.map(lambda lines:lines.replace("/"," ")) 
errorEntries =filterLines.filter(lambda errorLines : "Error" in errorLines) 

errEntry= errorEntries.map(lambda line: gettingData(line))#formatting the data 

ErrorFiltered = Row('ExecutionTimeStamp','ExecutionDate','ExecutionTime','ExecutionEpoch','ErrorNum','Message')
errorData = errEntry.map(lambda r: ErrorFiltered(*r))

errorDataDf = sqlContext.createDataFrame(errorData)

followed by the transformations .

when i am executing the script after splitting my 1gb log file into 20mb splits, the script part is working fine .

  spark-submit --jars /home/hpuser/LogAnaysisPOC/packages/spark-csv_2.10-1.5.0.jar,/home/hpuser/LogAnaysisPOC/packages/commons-csv-1.1.jar --master yarn-cluster --driver-memory 6g --executor-memory 6g --conf spark.yarn.driver.memoryOverhead=4096 --conf spark.yarn.executor.memoryOverhead=4096 /home/user/LogAnaysisPOC/scripts/essbase/Essbaselog.py
1) if i try to execute with 1gb as the input ,once ,it's failing(errorDataDf = sqlContext.createDataFrame(errorData)).
2) I need to join the parsed data with one meta-data data-frame which is shuffling around 43mb. dfinal.repartition(1).write.format("com.databricks.spark.csv").save("/user/user/loganalysis")
again it's working fine for splited data and failing for the data at once.
The job execution is failing with error : java.lang.OutOfMemoryError: Requested array size exceeds VM limit

Cluster metrics, RM UI snaps and application logs( log32.txt )are attached.

yarn.scheduler.capacity.root.queues=default,hive1,hive2
yarn.scheduler.capacity.root.default.user-limit-factor=1
yarn.scheduler.capacity.root.default.state=RUNNING
yarn.scheduler.capacity.root.default.maximum-capacity=100
yarn.scheduler.capacity.root.default.capacity=50
yarn.scheduler.capacity.root.default.acl_submit_applications=*
yarn.scheduler.capacity.root.capacity=100
yarn.scheduler.capacity.root.acl_administer_queue=*
yarn.scheduler.capacity.root.accessible-node-labels=*
yarn.scheduler.capacity.node-locality-delay=40
yarn.scheduler.capacity.maximum-applications=10000
yarn.scheduler.capacity.maximum-am-resource-percent=0.5
yarn.scheduler.capacity.queue-mappings-override.enable=false
yarn.scheduler.capacity.root.default.minimum-user-limit-percent=25
yarn.scheduler.capacity.root.default.ordering-policy=fifo
yarn.scheduler.capacity.root.hive1.acl_administer_queue=*
yarn.scheduler.capacity.root.hive1.acl_submit_applications=*
yarn.scheduler.capacity.root.hive1.capacity=25
yarn.scheduler.capacity.root.hive1.maximum-capacity=100
yarn.scheduler.capacity.root.hive1.minimum-user-limit-percent=25
yarn.scheduler.capacity.root.hive1.ordering-policy=fifo
yarn.scheduler.capacity.root.hive1.state=RUNNING
yarn.scheduler.capacity.root.hive1.user-limit-factor=1
yarn.scheduler.capacity.root.hive2.acl_administer_queue=*
yarn.scheduler.capacity.root.hive2.acl_submit_applications=*
yarn.scheduler.capacity.root.hive2.capacity=25
yarn.scheduler.capacity.root.hive2.maximum-capacity=100
yarn.scheduler.capacity.root.hive2.minimum-user-limit-percent=25
yarn.scheduler.capacity.root.hive2.ordering-policy=fifo
yarn.scheduler.capacity.root.hive2.state=RUNNING
yarn.scheduler.capacity.root.hive2.user-limit-factor=1
yarn.scheduler.capacity.root.user-limit-factor=1
-----------

58419-1.png

Thanks

2 REPLIES 2
Highlighted

Re: spark job unable to Execute in yarn-cluster mode

Did you take a look at the log you shared? It says "cannot resolve" some columns.

LogType:stdout
Log Upload Time:Fri Jan 26 04:38:43 -0500 2018
LogLength:1195
Log Contents:
Traceback (most recent call last):
  File "Essbaselog.py", line 57, in <module>
    dfinal=sqlContext.sql("Select metaDataTemp.id,errorDataTemp.ErrorNum,errorDataTemp.ExecutionTimeStamp,errorDataTemp.ExecutionDate,errorDataTemp.ExecutionTime,errorDataTemp.Message,date_format(current_date(), 'd/M/y') from metaDataTemp Inner Join  errorDataTemp on 1=1 where errorDataTemp.ErrorNum BETWEEN metaDataTemp.ErrorStart and metaDataTemp.ErrorEnd")
  File "/data/hadoop/yarn/local/usercache/user/appcache/application_1516887566537_0020/container_e22_1516887566537_0020_01_000001/pyspark.zip/pyspark/sql/context.py", line 583, in sql
  File "/data/hadoop/yarn/local/usercache/user/appcache/application_1516887566537_0020/container_e22_1516887566537_0020_01_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/data/hadoop/yarn/local/usercache/user/appcache/application_1516887566537_0020/container_e22_1516887566537_0020_01_000001/pyspark.zip/pyspark/sql/utils.py", line 51, in deco
pyspark.sql.utils.AnalysisException: u"cannot resolve 'metaDataTemp.ErrorStart' given input columns ExecutionTime, ExecutionTimeStamp, Message, ErrorNum, id, ExecutionDate, ExecutionEpoch;"
End of LogType:stdout

Re: spark job unable to Execute in yarn-cluster mode

New Contributor

@Sivaprasanna Thanks for looking over it . It was my mistake i had update the wrong log file. i have updated with the description.

Don't have an account?
Coming from Hortonworks? Activate your account here