About jestinm

jestinm · ‎08-08-2016

Hi Binu, thanks for the answer, but since for DataFrames, Spark still passes data between nodes, does Kryo still make sense as an optimization?

jestinm · ‎08-07-2016

When using DataFrames (Dataset<Row>), there's no option for an Encoder. Does that mean DataFrames (since it builds on top of an RDD) uses Java serialization? Does using Kyro make sense as an optimization here?If not, what's the difference between Java/Kyro serialization, Tungsten, and Encoders? Thank you!

jestinm · ‎07-31-2016

I used explain() and here is what I found. I found that without explicitly calling functions.broadcast(), there was no mention of broadcasting. When I called broadcast(), there was a broadcast hint in the logical plans but the final physical plan did not mention broadcasting. However, I do not see anything like a Broadcast RDD or an RDD (is this because I'm using DataFrames?)

jestinm · ‎07-31-2016

I'm doing a groupBy using DataFrames. I considered converting to RDD and doing reduceByKey, then converting back to DataFrames, but using DataFrames offers under-the-hood optimizations so I don't need to worry about the benefit of local aggregation.

jestinm · ‎07-31-2016

Hi Joe, thanks for the suggestions. I have broadcast join threshold set to 300 MB (but my spark SQL tab in the UI doesn't say broadcast join anywhere. Is this because the resulting physical plan finds a more optimal join than broadcast?) As for shuffle partitions, I currently have it as 611 because on the UI I see that the shuffle size for read/write is about 7.8 GB, and so I make each partition ~ 128 MB => 7800 mb / 128 mb = 611 partitions. Will update with results.

jestinm · ‎07-31-2016

Hi Benjamin, thanks for replying. I'm confused as to what you said about 16*2-3 being a good value for executors. From what I know, executor-cores should be set to be at most the number of cores in a single node (so executor-cores < 16 here). Or do you mean setting the executor-cores value to be 2-3?

jestinm · ‎07-30-2016

My data format is 2 csv files, one is 1.7 TB and one is about 200 MB. Update: Tried using a block size of 64 MB and there was no difference again. I tried using a block size of 256MB and there was no difference in time (see screenshot). The block size is seen as 256 MB. The task deserialization and GC time remains the same as before increasing block size. The relative input size for each executor is the same (the largest input size and the smallest input size are about 12-15 GB apart). Is this normal or is this considered data skew? Thank you for replying.

jestinm · ‎07-30-2016

I am processing ~2 TB of hdfs data using DataFrames. The size of a task is equal to the block size specified by hdfs, which happens to be 128 MB, leading to about 15000 tasks.I'm using 5 worker nodes with 16 cores each and ~25 GB RAM. I'm performing groupBy, count, and an outer-join with another DataFrame of ~200 MB size (~80 MB cached but I don't need to cache it), then saving to disk. Right now it takes about 55 minutes, and I've been trying to tune it. I read on the Spark Tuning guide that: In general, we recommend 2-3 tasks per CPU core in your cluster. This means that I should have about 30-50 tasks instead of 15000, and each task would be much bigger in size. Is my understanding correct, and is this suggested? I've read from difference sources to decrease or increase parallelism (by spark.default.parallelism or changing the block size), or even keep it default. Thank you for your help!

jestinm · ‎07-25-2016

Unfortunately, I'm doing a full outer join, so I can't filter.

jestinm · ‎07-23-2016

Hello,Right now I'm using DataFrames to perform a df1.groupBy(key).count() on one DataFrame and join with another, df2. The first, df1, is very large (many gigabytes) compared to df2 (250 Mb). Right now I'm running this on a cluster of 5 nodes, 16 cores each, 90 GB RAM each. It is taking me about 1 hour and 40 minutes to perform the groupBy, count, and join, which seems very slow to me. Currently I have set the following in my spark-defaults.conf: spark.executor.instances 24 spark.executor.memory 10g spark.executor.cores 3 spark.driver.memory 5g spark.sql.autoBroadcastJoinThreshold 200Mb I have a couple of questions regarding tuning for performance as a beginner. Right now I'm running Spark 1.6.0. Would moving to Spark 2.0 DataSet (or even DataFrames) be much better? What if I used RDDs instead? I know that reduceByKey is better than groupByKey, and DataFrames don't have that method. I think I can do a broadcast join and have set a threshold. Do I need to set it above my second DataFrame size? Do I need to explicitly call broadcast(df2)? What's the point of driver memory? Can anyone point out something wrong with my tuning numbers, or any additional parameters worth checking out? Thank you a lot! Sincerely, Jestin

Online	Offline
Last Visited	‎08-14-2016 06:06 AM

Member Since	‎06-29-2016 07:28 PM
Last Visited	‎08-14-2016 06:06 AM
Posts	30
Kudos received	3

Cloudera Community

Re: DataFrames with Kryo serialization

DataFrames with Kryo serialization

Re: Tuning parallelism: increase or decrease?

Re: Tuning parallelism: increase or decrease?

Re: Tuning parallelism: increase or decrease?

Re: Tuning parallelism: increase or decrease?

Re: Tuning parallelism: increase or decrease?

Tuning parallelism: increase or decrease?

Re: Rdd/DataFrame/DataSet Performance Tuning

Rdd/DataFrame/DataSet Performance Tuning