About snikitin

snikitin · ‎08-14-2018

@Felix Albani Thank you for answering! Yes, I were monitoring the progress in Spark web UI. The amount of job did not grow from iteration to iteration, while the time spending on these jobs did. To be honest, I managed my problem. I just read the whole file and successfully joined it with another. However, I want to know, why "for"-loops are so ineffective in terms of Spark.

snikitin · ‎08-13-2018

count = 0 ln = len(bkg_lst) start = 0 for i in bkg_lst[:]: # list contains paths to files part_bkg = sqlContext.read.option("delimiter", ";").format('com.databricks.spark.csv').options(header='false', inferschema='true').load(i) if len(part_bkg.head(1)) != 0: # filter out unnessesary columns part_bkg = part_bkg.select('BKG_ID', 'PNR_ID', 'PAX_ID', 'SEG_NBR', 'PNR_CRTD_DT', 'OPRTNG_FLT_DT_UTC', 'TRAVELER_ORIG_TERMINAL', 'TRAVELER_DEST_TERMINAL', 'DEL_FLG', 'SRC_SYS_CRTD_TMSTMP') part_bkg.createOrReplaceTempView("bkg") # here goes nessesary SQL select, not heavy for sure res_df = sqlContext.sql("""select b.ucp_id from ...""") if len(res_df.head(1)) != 0: res_df.select('UCP_ID', 'TRAVELER_DEST_TERMINAL').intersect(df_rec.select('UCP_ID','ITEM_KEY')).select('UCP_ID').coalesce(1).write.save(path='/tmp/true_recs/' +str(count) , format='csv', mode='append', sep=',') count +=1 print(count, " of ", ln)<br>

snikitin · ‎08-13-2018

Hello! I am new to Spark (pyspark, to be concrete). In my task I have cached in memory dataframe of 300 million rows. On the other hand I have 215 files, each I can read into dataframe of 600k rows. I want to iteratively read file by file, find intersection by some field with cached dataframe and write result to another file. First 10-12 iterates are being done pretty fast with equal time, 20-25 sec, I monitor them Spark UI. However execution time at next few iterates increases dramatically, 1-2 min and more. To be more concrete, at write operation. Finally, the whole job get stuck, so I have to kill it. Could you please give me an idea, why this happens? Which resources I need to release (and how) to start each iteration from scratch?

snikitin · ‎08-10-2018

So, if there is really no way to restart certain sparkContext, what is the reason developers of Zeppelin do not add this possibility? I know it is possible using Jupyter Notebook.

snikitin · ‎08-09-2018

Restarting the interpreter will affect on all users

snikitin · ‎08-08-2018

Thank you for answering. I did this, but in this case interpreter is restarted for all users, which drops their progress, I checked.

snikitin · ‎08-08-2018

Hello! My situation is following. We have several users running their PySpark tasks via Zeppelin. Spark interpreter is set to "per user, isolated" mode, so each user has its own sparkContext when launch first pyspark paragraph. The problem occurs when certain user's task get stuck, and there is a need to kill its sparkContext and rerun zeppelin notebook. As admin, I can kill application in YARN UI. However, notebook continues to "run" nonexistend task in paragraph. So far, I found the only way - restart zeppelin from Ambari. But I do not want to interrupt the work process of other users. So, is there any way to restart notebook and sparkContext for a particular user ? Thank you in advance.

snikitin · ‎07-25-2018

Sorry, it was totally my mistake. I did modify YARN scheduler configuration in a previous tutorial and this affected on spark. I created sub-queues in scheduler, but did not add spark to any of them.

snikitin · ‎07-25-2018

Hello! I am learning tutorial for HDP sandbox related to zeppelin : https://hortonworks.com/tutorial/hands-on-tour-of-apache-spark-in-5-minutes/ I downloaded and started notebook, saved all interpreters, they all were blue. I did not change any Zeppelin or Spark configs. However, first code paragraph fails with Null Pointer Exception. java.lang.NullPointerException at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:38) at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:33) at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext_2(SparkInterpreter.java:348) at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext(SparkInterpreter.java:337) at org.apache.zeppelin.spark.SparkInterpreter.getSparkContext(SparkInterpreter.java:142) at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:790) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:493) at org.apache.zeppelin.scheduler.Job.run(Job.java:175) at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) %spark2.spark spark.version What could be wrong?

Online	Offline
Last Visited	‎09-03-2018 12:28 PM

Member Since	‎07-25-2018 12:11 PM
Last Visited	‎09-03-2018 12:28 PM
Posts	9

Cloudera Community

Re: Spark get slower in iteration process.

Re: Spark get slower in iteration process.

Spark get slower in iteration process.

Re: How to restart sparkcontext for certain user?

Re: How to restart sparkcontext for certain user?

Re: How to restart sparkcontext for certain user?

How to restart sparkcontext for certain user?

Re: Spark Interpreter fails with Null Pointer Exce...

Spark Interpreter fails with Null Pointer Exceptio...