About hubbarja

hubbarja · ‎12-14-2016

When spark determines it needs to use yarn's localizer, it will always load the jar to HDFS, it does not attempt to check if the file changed before loading. When using the Spark distributed included with CDH, the spark jar is already loaded to all nodes and specifies the jar is local. When specifying it is local, spark will not upload the jar and yarn's localizer is not used.

hubbarja · ‎12-13-2016

Hi Ranan, Because this is an older thread and already marked as solved, lets keep this conversation on the other thread you opened: http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Debug-Spark-program-in-Eclipse-Data-in-AWS/m-p/48472#U48472

zhuangmz · ‎11-29-2016

I've found the solution here: https://community.cloudera.com/t5/Beta-Releases-Apache-Kudu/Spark-2-beta-load-or-save-Hive-managed-table/m-p/47406#M374

HadoopSiva · ‎11-23-2016

It will be of great help to me. Thanks a lot Kamalakanta 🙂 I have some more doubts regarding running spark SQL queries in parallel.

hubbarja · ‎11-04-2016

I've seen issues with some hardware where using local[*] doesn't use number of cores like expected. The java method to get the number of cores available for the process isn't always consistent. Instead, try specifying the number explicitly like local[6] and try again.

hubbarja · ‎10-17-2016

You are currently unable to restart a streaming context after it has been stopped. You can instead create a new streaming context or you can restart the entire application. You can also enable checkpointing and start the context from the checkpoints to recover from any unclean stops: http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing

hubbarja · ‎10-12-2016

You may need to check to make sure your rdd is not empty, depending on your processing empty batches within spark streaming can cause some issues. !rdd.isEmpty

hubbarja · ‎10-12-2016

Great, I'm glad the udf worked. As for the numpy issue, I'm not familiar enough with using numpy within spark to give any insights, but the workaround seems trivial enough. If you are looking for a more elegant solution, you may want to create a new thread and include the error. You may also want to take a look at sparks mllib statistics functions[1], though they operate across rows instead of within a single column. 1. http://spark.apache.org/docs/latest/mllib-statistics.html

backtrack5 · ‎09-28-2016

Hi @hubbarja, I have achieved what i want partially. Following is a code sample, def functionToCreateContext(): sc = SparkContext("local[*]", "dedup") ssc = StreamingContext(sc, 60) messagesample = ssc.textFileStream("input") ssc.checkpoint("dedup_data_checkpoint") message_id_hash = messagesamplelines.map(lambda line: line.split("^")).reduceByKey(lambda x,y:(x,y)) RDD = message_id_hash.updateStateByKey(updateFunction) RDD.join(messagesamples).filter(isDuplicate).map(deuplicateRecords)\ .saveAsTextFiles('output.txt') return ssc It is working fine for me. Only problem is. It is creating files for each and every timestamp. I am trying to fix it.

hubbarja · ‎09-23-2016

A lost task often means the task had an OOM or YARN killed the task because it was using more memory than it had requested. Check the task logs and the application master logs, you can pull the logs from yarn with: yarn logs -applicationId <application ID> If yarn killed the task, it will say so within the application master. If this is the case, you can increase the overhead spark requests beyond executor memory with spark.yarn.executor.memoryOverhead, it defaults to requesting 10% of the executor memory.

Online	Offline
Last Visited	‎11-02-2018 12:33 PM

Member Since	‎05-10-2016 01:39 PM
Last Visited	‎11-02-2018 12:33 PM
Posts	97
Kudos received	19

Cloudera Community

Re: getText method is not available while working ...

Re: SparkSQL key not found: scale

Re: Can I upgrade Apache Spark when I'm using pack...

Re: Akka http exception on Spark

Re: Writing Timestamp columns in Parquet Files t...

Re: How does spark runtime jar (../spark-2.0.1/jar...

Re: Connection timeout in spark program (Eclipse)

Re: Spark2 beta Hive metastore configuration

Re: How to run spark sql in parallel?

Re: How to save each partition of a Dataframe/Data...

Re: How to restart spark streaming

Re: SparkStreaming nullPointerException on rdd.for...

Re: PySpark: How to add column to dataframe with c...

Re: spark stream based deduplication

Re: Executor Timed Out