I have just read the post:
@ClouderaEng blog:: working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle
And, it is a very nice write-up. I had a similar problem, and was wondering about what solution does Ilya Ganelin would have taken or if Justin Kestelyn could help with the brief on the approach with Spark. Thanks in advance!
I'll provide the link -
Of course my first thought is that the article is two years old and things have changed quite a bit since then. @mbigelow is right though, some additional background on your needs would be helpful.