Member since
05-10-2016
97
Posts
19
Kudos Received
13
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3074 | 06-13-2017 09:20 AM | |
9216 | 02-02-2017 06:34 AM | |
4078 | 12-26-2016 12:36 PM | |
2962 | 12-26-2016 12:34 PM | |
51441 | 12-22-2016 05:32 AM |
11-02-2016
07:06 PM
Spark will process data in parrallel per "partition" which is a block of data. If you have a single spark partition, it will only use one task to write which will be sequential. If you would like to increase parrallelism, you can use coalesce or repartition with the shuffle option or sometimes there is an option to specify number of partitions within your transformation functions. A note though, if your spark partitions are not split by the hive partition column, each spark partition can write to many hive partitions which may cause many small files.
... View more
11-02-2016
07:45 AM
You will need to have Spark authenticate via Kerberos. This can be done by specifying correct properties on command line: https://www.cloudera.com/documentation/enterprise/5-7-x/topics/sg_spark_auth.html
... View more
10-22-2016
07:26 PM
2 Kudos
There is nothing native within Spark to handle running queries in parallel. Instead you can take a look at Java concurrency and in particular Futures[1] which will allow you to start queries in parallel and check status later. 1. https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/Future.html
... View more
10-17-2016
07:20 AM
You are currently unable to restart a streaming context after it has been stopped. You can instead create a new streaming context or you can restart the entire application. You can also enable checkpointing and start the context from the checkpoints to recover from any unclean stops: http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
... View more
10-12-2016
06:21 PM
You may need to check to make sure your rdd is not empty, depending on your processing empty batches within spark streaming can cause some issues. !rdd.isEmpty
... View more
10-12-2016
06:00 PM
Great, I'm glad the udf worked. As for the numpy issue, I'm not familiar enough with using numpy within spark to give any insights, but the workaround seems trivial enough. If you are looking for a more elegant solution, you may want to create a new thread and include the error. You may also want to take a look at sparks mllib statistics functions[1], though they operate across rows instead of within a single column. 1. http://spark.apache.org/docs/latest/mllib-statistics.html
... View more
10-11-2016
07:18 PM
1 Kudo
Hi Brian, You shouldn't need to use exlode, that will create a new row for each value in the array. The reason max isn't working for your dataframe is because it is trying to find the max for that column for every row in you dataframe and not just the max in the array. Instead you will need to define a udf and call the udf within withColumn from pyspark.sql.functions import udf
def maxList(list):
max(list)
maxUdf==udf(scoreToCategory, FloatType())
df = df.withColumn('WF_Peak', maxUdf('wfdataseries')) As for using pandas and converting back to Spark DF, yes you will have a limitation on memory. toPandas calls collect on the dataframe and brings the entire dataset into memory on the driver, so you will be moving data across network and holding locally in memory, so this should only be called if the DF is small enough to store locally.
... View more
09-28-2016
06:35 AM
Are you trying to join your stateful rdd to another rdd or just trying to look up a specific record? Depending on how large the dataset is and how many lookups, you could call collect to bring the rdd to the driver and do quick lookups there or call the filter on the rdd to filter only the records you want before calling collect to the driver. If you are joining to another dataset, you will first need to map over the rdd and create a key value pair on both the history rdd and the other rdd you are joining to.
... View more
09-26-2016
09:08 AM
1 Kudo
Unfortunately mapWithState is only available in Java and Scala right now, it hasn't been ported to python yet. Please follow https://issues.apache.org/jira/browse/SPARK-16854 if you are interested in knowing when this will be worked on and +1 the jira to know you are interested in using this within python.
... View more
09-26-2016
07:06 AM
1 Kudo
Sounds like you should be able to use updateStateByKey for your scenerio, in newer versions of spark there are other stateful operators you may use as well that may give you better performance (mapWithState). When using the stateful operations like updateStateByKey, you will define a checkpoint that will save the state for you. If a failure occurs, spark will recover the state from the checkpoint dir and rerun any records needed to catch up.
... View more
- « Previous
- Next »