Member since
06-29-2016
30
Posts
3
Kudos Received
0
Solutions
08-08-2016
08:58 PM
Hi Binu, thanks for the answer, but since for DataFrames, Spark still passes data between nodes, does Kryo still make sense as an optimization?
... View more
08-07-2016
10:48 PM
When using DataFrames (Dataset<Row>), there's no option for an Encoder. Does that mean DataFrames (since it builds on top of an RDD) uses Java serialization? Does using Kyro make sense as an optimization here?If not, what's the difference between Java/Kyro serialization, Tungsten, and Encoders?
Thank you!
... View more
Labels:
- Labels:
-
Apache Spark
07-31-2016
11:14 PM
I used explain() and here is what I found. I found that without explicitly calling functions.broadcast(), there was no mention of broadcasting. When I called broadcast(), there was a broadcast hint in the logical plans but the final physical plan did not mention broadcasting. However, I do not see anything like a Broadcast RDD or an RDD (is this because I'm using DataFrames?)
... View more
07-31-2016
10:25 PM
I'm doing a groupBy using DataFrames. I considered converting to RDD and doing reduceByKey, then converting back to DataFrames, but using DataFrames offers under-the-hood optimizations so I don't need to worry about the benefit of local aggregation.
... View more
07-31-2016
09:17 PM
Hi Joe, thanks for the suggestions. I have broadcast join threshold set to 300 MB (but my spark SQL tab in the UI doesn't say broadcast join anywhere. Is this because the resulting physical plan finds a more optimal join than broadcast?) As for shuffle partitions, I currently have it as 611 because on the UI I see that the shuffle size for read/write is about 7.8 GB, and so I make each partition ~ 128 MB => 7800 mb / 128 mb = 611 partitions. Will update with results.
... View more
07-31-2016
05:09 AM
Hi Benjamin, thanks for replying. I'm confused as to what you said about 16*2-3 being a good value for executors. From what I know, executor-cores should be set to be at most the number of cores in a single node (so executor-cores < 16 here). Or do you mean setting the executor-cores value to be 2-3?
... View more
07-30-2016
08:54 PM
My data format is 2 csv files, one is 1.7 TB and one is about 200 MB. Update: Tried using a block size of 64 MB and there was no difference again. I tried using a block size of 256MB and there was no difference in time (see screenshot). The block size is seen as 256 MB. The task deserialization and GC time remains the same as before increasing block size. The relative input size for each executor is the same (the largest input size and the smallest input size are about 12-15 GB apart). Is this normal or is this considered data skew?
Thank you for replying.
... View more
07-30-2016
06:57 AM
I am processing ~2 TB of hdfs data using DataFrames. The size of a task is equal to the block size specified by hdfs, which happens to be 128 MB, leading to about 15000 tasks.I'm using 5 worker nodes with 16 cores each and ~25 GB RAM.
I'm performing groupBy, count, and an outer-join with another DataFrame of ~200 MB size (~80 MB cached but I don't need to cache it), then saving to disk.
Right now it takes about 55 minutes, and I've been trying to tune it.
I read on the Spark Tuning guide that:
In general, we recommend 2-3 tasks per CPU core in your cluster.
This means that I should have about 30-50 tasks instead of 15000, and each task would be much bigger in size. Is my understanding correct, and is this suggested? I've read from difference sources to decrease or increase parallelism (by spark.default.parallelism or changing the block size), or even keep it default. Thank you for your help!
... View more
Labels:
- Labels:
-
Apache NiFi
-
Apache Spark
07-23-2016
03:33 PM
Hello,Right now I'm using DataFrames to perform a df1.groupBy(key).count() on one DataFrame and join with another, df2.
The first, df1, is very large (many gigabytes) compared to df2 (250 Mb).
Right now I'm running this on a cluster of 5 nodes, 16 cores each, 90 GB RAM each.
It is taking me about 1 hour and 40 minutes to perform the groupBy, count, and join, which seems very slow to me.
Currently I have set the following in my spark-defaults.conf:
spark.executor.instances 24 spark.executor.memory 10g spark.executor.cores 3 spark.driver.memory 5g spark.sql.autoBroadcastJoinThreshold 200Mb I have a couple of questions regarding tuning for performance as a beginner.
Right now I'm running Spark 1.6.0. Would moving to Spark 2.0 DataSet (or even DataFrames) be much better? What if I used RDDs instead? I know that reduceByKey is better than groupByKey, and DataFrames don't have that method. I think I can do a broadcast join and have set a threshold. Do I need to set it above my second DataFrame size? Do I need to explicitly call broadcast(df2)? What's the point of driver memory? Can anyone point out something wrong with my tuning numbers, or any additional parameters worth checking out? Thank you a lot! Sincerely,
Jestin
... View more
Labels:
- Labels:
-
Apache Spark