About hubbarja

hubbarja · ‎11-02-2016

Spark will process data in parrallel per "partition" which is a block of data. If you have a single spark partition, it will only use one task to write which will be sequential. If you would like to increase parrallelism, you can use coalesce or repartition with the shuffle option or sometimes there is an option to specify number of partitions within your transformation functions. A note though, if your spark partitions are not split by the hive partition column, each spark partition can write to many hive partitions which may cause many small files.

hubbarja · ‎11-02-2016

You will need to have Spark authenticate via Kerberos. This can be done by specifying correct properties on command line: https://www.cloudera.com/documentation/enterprise/5-7-x/topics/sg_spark_auth.html

hubbarja · ‎10-22-2016

There is nothing native within Spark to handle running queries in parallel. Instead you can take a look at Java concurrency and in particular Futures[1] which will allow you to start queries in parallel and check status later. 1. https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/Future.html

hubbarja · ‎10-17-2016

You are currently unable to restart a streaming context after it has been stopped. You can instead create a new streaming context or you can restart the entire application. You can also enable checkpointing and start the context from the checkpoints to recover from any unclean stops: http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing

hubbarja · ‎10-12-2016

You may need to check to make sure your rdd is not empty, depending on your processing empty batches within spark streaming can cause some issues. !rdd.isEmpty

hubbarja · ‎10-12-2016

Great, I'm glad the udf worked. As for the numpy issue, I'm not familiar enough with using numpy within spark to give any insights, but the workaround seems trivial enough. If you are looking for a more elegant solution, you may want to create a new thread and include the error. You may also want to take a look at sparks mllib statistics functions[1], though they operate across rows instead of within a single column. 1. http://spark.apache.org/docs/latest/mllib-statistics.html

hubbarja · ‎10-11-2016

Hi Brian, You shouldn't need to use exlode, that will create a new row for each value in the array. The reason max isn't working for your dataframe is because it is trying to find the max for that column for every row in you dataframe and not just the max in the array. Instead you will need to define a udf and call the udf within withColumn from pyspark.sql.functions import udf def maxList(list): max(list) maxUdf==udf(scoreToCategory, FloatType()) df = df.withColumn('WF_Peak', maxUdf('wfdataseries')) As for using pandas and converting back to Spark DF, yes you will have a limitation on memory. toPandas calls collect on the dataframe and brings the entire dataset into memory on the driver, so you will be moving data across network and holding locally in memory, so this should only be called if the DF is small enough to store locally.

hubbarja · ‎09-28-2016

Are you trying to join your stateful rdd to another rdd or just trying to look up a specific record? Depending on how large the dataset is and how many lookups, you could call collect to bring the rdd to the driver and do quick lookups there or call the filter on the rdd to filter only the records you want before calling collect to the driver. If you are joining to another dataset, you will first need to map over the rdd and create a key value pair on both the history rdd and the other rdd you are joining to.

hubbarja · ‎09-26-2016

Unfortunately mapWithState is only available in Java and Scala right now, it hasn't been ported to python yet. Please follow https://issues.apache.org/jira/browse/SPARK-16854 if you are interested in knowing when this will be worked on and +1 the jira to know you are interested in using this within python.

hubbarja · ‎09-26-2016

Sounds like you should be able to use updateStateByKey for your scenerio, in newer versions of spark there are other stateful operators you may use as well that may give you better performance (mapWithState). When using the stateful operations like updateStateByKey, you will define a checkpoint that will save the state for you. If a failure occurs, spark will recover the state from the checkpoint dir and rerun any records needed to catch up.

Online	Offline
Last Visited	‎11-02-2018 12:33 PM

Member Since	‎05-10-2016 01:39 PM
Last Visited	‎11-02-2018 12:33 PM
Posts	97
Kudos received	19

Cloudera Community

Re: getText method is not available while working ...

Re: SparkSQL key not found: scale

Re: Can I upgrade Apache Spark when I'm using pack...

Re: Akka http exception on Spark

Re: Writing Timestamp columns in Parquet Files t...

Re: How to save each partition of a Dataframe/Data...

Re: PySpark + YARN + Kerberos = Chaos?

Re: How to run spark sql in parallel?

Re: How to restart spark streaming

Re: SparkStreaming nullPointerException on rdd.for...

Re: PySpark: How to add column to dataframe with c...

Re: PySpark: How to add column to dataframe with c...

Re: spark stream based deduplication

Re: spark stream based deduplication

Re: spark stream based deduplication