Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Why does write.mode("append") cause spark to create hundreds of tasks?

Highlighted

Why does write.mode("append") cause spark to create hundreds of tasks?

Contributor

Hi all,

I'm performing a write operation to a postgres database in spark. The dataframe has 44k rows and is in 4 partitions. But the spark job takes 20mins+ to complete. Looking at the logs (attached) I see the map stage is the bottleneck where over 600+ tasks are created. Does anyone have any insight into why this might be and how I could resolve the performance issue? I've also included a screenshot of my cluster metrics.

Thanks,

Mike

77623-log.png

77625-clustermetrics.png

4 REPLIES 4

Re: Why does write.mode("append") cause spark to create hundreds of tasks?

Contributor

I think I've found the issue, the hive table I query to create the data frame has the same number of underlying HDFS blocks. Merging these together has improved performance, although it still takes 5mins to complete.

Highlighted

Re: Why does write.mode("append") cause spark to create hundreds of tasks?

Super Guru

Try a map over partition and have each partition write several hundred/thousands of rows to pg.

Highlighted

Re: Why does write.mode("append") cause spark to create hundreds of tasks?

Contributor
@sunile.manjee

Are you referring to mappartition() (http://apachesparkbook.blogspot.com/2015/11/mappartition-example.html) ?...Could you provide an example of how i might apply this?

thx

mike

Highlighted

Re: Why does write.mode("append") cause spark to create hundreds of tasks?

Super Guru

yes that is it. Basically inside the iterator you would create a large insert statement.

INSERT INTO films (code, title, did, date_prod, kind) VALUES
    ('B6717', 'Tampopo', 110, '1985-02-10', 'Comedy'),
    ('HG120', 'The Dinner Game', 140, DEFAULT, 'Comedy');

Your column names can come from the dataframe and the values are from the dataframe it self. therefore nothing is hard coded and you can reuse this code for virtually any database which uses ansi sql inserts.

Don't have an account?
Coming from Hortonworks? Activate your account here