Member since
07-25-2018
9
Posts
0
Kudos Received
0
Solutions
08-14-2018
06:38 AM
@Felix Albani Thank you for answering! Yes, I were monitoring the progress in Spark web UI. The amount of job did not grow from iteration to iteration, while the time spending on these jobs did. To be honest, I managed my problem. I just read the whole file and successfully joined it with another. However, I want to know, why "for"-loops are so ineffective in terms of Spark.
... View more
08-13-2018
06:37 PM
count = 0
ln = len(bkg_lst)
start = 0
for i in bkg_lst[:]: # list contains paths to files
part_bkg = sqlContext.read.option("delimiter", ";").format('com.databricks.spark.csv').options(header='false', inferschema='true').load(i)
if len(part_bkg.head(1)) != 0:
# filter out unnessesary columns
part_bkg = part_bkg.select('BKG_ID', 'PNR_ID', 'PAX_ID', 'SEG_NBR', 'PNR_CRTD_DT', 'OPRTNG_FLT_DT_UTC', 'TRAVELER_ORIG_TERMINAL', 'TRAVELER_DEST_TERMINAL', 'DEL_FLG', 'SRC_SYS_CRTD_TMSTMP')
part_bkg.createOrReplaceTempView("bkg")
# here goes nessesary SQL select, not heavy for sure
res_df = sqlContext.sql("""select b.ucp_id from ...""")
if len(res_df.head(1)) != 0:
res_df.select('UCP_ID', 'TRAVELER_DEST_TERMINAL').intersect(df_rec.select('UCP_ID','ITEM_KEY')).select('UCP_ID').coalesce(1).write.save(path='/tmp/true_recs/' +str(count) , format='csv', mode='append', sep=',')
count +=1
print(count, " of ", ln)<br>
... View more
08-13-2018
06:37 PM
Hello! I am new to Spark (pyspark, to be concrete). In my task I have cached in memory dataframe of 300 million rows. On the other hand I have 215 files, each I can read into dataframe of 600k rows. I want to iteratively read file by file, find intersection by some field with cached dataframe and write result to another file. First 10-12 iterates are being done pretty fast with equal time,
20-25 sec, I monitor them Spark UI. However execution time at next few
iterates increases dramatically, 1-2 min and more. To be more concrete,
at write operation. Finally, the whole job get stuck, so I have to kill
it. Could you please give me an idea, why this happens? Which
resources I need to release (and how) to start each iteration from
scratch?
... View more
Labels:
- Labels:
-
Apache Spark
08-10-2018
06:55 AM
So, if there is really no way to restart certain sparkContext, what is
the reason developers of Zeppelin do not add this possibility? I know it is possible using Jupyter Notebook.
... View more
08-09-2018
09:13 AM
Restarting the interpreter will affect on all users
... View more
08-08-2018
03:21 PM
Thank you for answering. I did this, but in this case interpreter is restarted for all users, which drops their progress, I checked.
... View more
08-08-2018
02:05 PM
Hello! My situation is following. We have several users running their PySpark tasks via Zeppelin. Spark interpreter is set to "per user, isolated" mode, so each user has its own sparkContext when launch first pyspark paragraph. The problem occurs when certain user's task get stuck, and there is a need to kill its sparkContext and rerun zeppelin notebook. As admin, I can kill application in YARN UI. However, notebook continues to "run" nonexistend task in paragraph. So far, I found the only way - restart zeppelin from Ambari. But I do not want to interrupt the work process of other users. So, is there any way to restart notebook and sparkContext for a particular user ? Thank you in advance.
... View more
Labels:
- Labels:
-
Apache Spark
-
Apache Zeppelin
07-25-2018
02:23 PM
Sorry, it was totally my mistake. I did modify YARN scheduler configuration in a previous tutorial and this affected on spark. I created sub-queues in scheduler, but did not add spark to any of them.
... View more
07-25-2018
12:22 PM
Hello! I am learning tutorial for HDP sandbox related to zeppelin : https://hortonworks.com/tutorial/hands-on-tour-of-apache-spark-in-5-minutes/ I downloaded and started notebook, saved all interpreters, they all were blue. I did not change any Zeppelin or Spark configs. However, first code paragraph fails with Null Pointer Exception. java.lang.NullPointerException
at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:38)
at org.apache.zeppelin.spark.Utils.invokeMethod(Utils.java:33)
at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext_2(SparkInterpreter.java:348)
at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext(SparkInterpreter.java:337)
at org.apache.zeppelin.spark.SparkInterpreter.getSparkContext(SparkInterpreter.java:142)
at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:790)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:493)
at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
%spark2.spark
spark.version
What could be wrong?
... View more
Labels: