About fil

Harsh J · ‎09-09-2015

Start here, and drill further down into the DFSClient and DFSInputStream, etc. classes: https://github.com/cloudera/hadoop-common/blob/cdh5.4.5-release/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java#L294-L303

bjorn.jonsson · ‎08-10-2015

Hi, As described in the sort based shuffle design doc (https://issues.apache.org/jira/secure/attachment/12655884/Sort-basedshuffledesign.pdf), each map task should generate 1 shuffle data file 1 index file. Regarding your second question, the property to specify the buffer for shuffle data is "spark.shuffle.memoryFraction". This is discussed in more detail in the following Cloudera blog: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ Regards, Bjorn

srowen · ‎07-27-2015

The first case is: read - shuffle - persist - count The second case is: read (from persisted copy) - count You are right that coalesce does not always shuffle, but it may in this case. It depends on whether you started with more or fewer partitions. You should look at the Spark UI to see whether a shuffle occurred.

alex.behm · ‎07-17-2015

Thanks for the update. I can reproduce the issue, but only when the target partition is empty. As soon as I add some data, compute incremental stats works as expected. So I'm still thinking you are hitting an edge case with an empty partition?

cjervis · ‎07-16-2015

I am happy to see that you found your answer. Thanks for sharing it. 🙂

fil · ‎07-15-2015

Actually problem was in very agressive caching and overfilling spark.yarn.executor.memoryOverhead buffer and as cosequence OOM error. i just increase it and everything works now

fil · ‎07-08-2015

thanks, Alex

HenryR · ‎07-02-2015

None of Impala's supported file formats are able to store data in sorted order on disk. Therefore the ORDER BY clause in the INSERT does not have any effect. The data is written out in a potentially unsorted order regardless. Best, Henry

fil · ‎06-30-2015

thanks a lot!

jkestelyn · ‎06-23-2015

The QuickStart VM includes example data. If you're looking for a VM that is exclusive to Spark, I don't think you'll find that.

Online	Offline
Last Visited	‎02-26-2020 04:56 PM

Member Since	‎09-17-2014 01:36 AM
Last Visited	‎02-26-2020 04:56 PM
Posts	88
Kudos received	3

Cloudera Community

Re: What does mean AverageThreadTokens in impala's...

Re: Spark's faill durring persist()

Re: Hadoop read IO size

Re: Number of intermediate files with Sort shuffle...

Re: Benefit of DISK_ONLY persists

Re: Impala won't update stats on Hive Avro table

Re: What does mean AverageThreadTokens in impala's...

Re: Spark's faill durring persist()

Re: Force encoding type for given column

Re: Performance Reduced after Removing ORDER BY cl...

Re: Restrict users for FairScheduler pool

Re: VM for Spark/Scala development