Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to save each partition of a Dataframe/Dataset in parallel with partitionBy or InsertInto Hive

How to save each partition of a Dataframe/Dataset in parallel with partitionBy or InsertInto Hive

New Contributor

Hello

I currently use spark 2.0.1 and i try to save my dataset into a "partitioned table Hive" with insertInto() or on S3 storage with partitionBy("col") with job in concurrency (parallel). But with this 2 methods each partition of my dataset is save sequentially one by one . It's very very SLOW.

I already know that I must use insertInto() or partitionBy() one at time.

I assume that in spark.2.0.1 Dataframe are Resilient Data Set .

My current code :

df.write.mode(SaveMode.Append).partitionBy("col").save("s3://bucket/diroutput")

Or

df.write.mode(SaveMode.Append).insertInto("TableHivealreadypartitioned")

So I try some stuff with df.foreachPartition like this :

df.foreachPartition{datasetpartition => datasetpartition.foreach(row => row.sometransformation)}

Unfortunately i still do not find a way to write/save in parallel each partition of my dataset.

Someone already done this?

Can you tell me how to proceed?

Is it a wrong direction?

thanks for your help

5 REPLIES 5

Re: How to save each partition of a Dataframe/Dataset in parallel with partitionBy or InsertInto Hi

Expert Contributor

Spark will process data in parrallel per "partition" which is a block of data.  If you have a single spark partition, it will only use one task to write which will be sequential.  If you would like to increase parrallelism, you can use coalesce or repartition with the shuffle option or sometimes there is an option to specify number of partitions within your transformation functions.

 

A note though, if your spark partitions are not split by the hive partition column, each spark partition can write to many hive partitions which may cause many small files.

Re: How to save each partition of a Dataframe/Dataset in parallel with partitionBy or InsertInto Hi

New Contributor

Hello hubbarja . Thanks for your feedback.

 

You will find below an extract logs .In the first example it is the "InserInto(tablehivealreadypartitionned)" in hive.

We can see that all "partitions" Spark are written one by one.

In the second example it is the "partitionBy().save()" that write directly to S3.

We can see also that all "partitions" spark are written one by one.

The dataframe we handle only has one "partition" and the size of it is about 200MB uncompressed (in memory).

The Job can Take 120s 170s to save the Data with the option local[4] .

 

INFO] 2016-11-03 00:10:33,255 org.apache.spark.SparkContext logInfo - Created broadcast 2330 from broadcast at TorExitLookup.scala:43
[INFO] 2016-11-03 00:10:35,302 org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer logInfo - Sorting complete. Writing out partition files one at a time.
[INFO] 2016-11-03 00:10:35,363 com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream close - close closed:false s3://BUCKETS3/db/.hive-staging_hive_2016-11-03_00-10-29_426_1749488585639143697-1/-ext-10000/tsbucket=2016-11-02 09%3A00%3A00/part-00001
[INFO] 2016-11-03 00:10:35,380 org.apache.spark.mapred.SparkHadoopMapRedUtil logInfo - No need to commit output of task because needsTaskCommit=false: attempt_201611030010_0948_m_000001_0
[INFO] 2016-11-03 00:10:35,380 org.apache.spark.executor.Executor logInfo - Finished task 1.0 in stage 948.0 (TID 1385). 2652 bytes result sent to driver
[INFO] 2016-11-03 00:10:35,381 org.apache.spark.scheduler.TaskSetManager logInfo - Finished task 1.0 in stage 948.0 (TID 1385) in 5718 ms on localhost (1/2)
[INFO] 2016-11-03 00:11:23,033 org.apache.spark.storage.BlockManagerInfo logInfo - Removed broadcast_2330_piece0 on 10.0.193.149:34016 in memory (size: 6.9 KB, free: 414.4 MB)
[INFO] 2016-11-03 00:11:58,194 org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer logInfo - Sorting complete. Writing out partition files one at a time.
[INFO] 2016-11-03 00:12:00,210 org.apache.spark.storage.BlockManagerInfo logInfo - Removed broadcast_2329_piece0 on 10.0.193.149:34016 in memory (size: 6.9 KB, free: 414.4 MB)
[INFO] 2016-11-03 00:12:05,295 com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream close - close closed:false s3://BUCKETS3/db/.hive-staging_hive_2016-11-03_00-10-29_426_1749488585639143697-1/-ext-10000/tsbucket=2016-11-02 09%3A00%3A00/part-00000
[INFO] 2016-11-03 00:12:05,831 org.apache.spark.mapred.SparkHadoopMapRedUtil logInfo - No need to commit output of task because needsTaskCommit=false: attempt_201611030010_0948_m_000000_0
[INFO] 2016-11-03 00:12:05,835 org.apache.spark.executor.Executor logInfo - Finished task 0.0 in stage 948.0 (TID 1384). 2652 bytes result sent to driver
[INFO] 2016-11-03 00:12:05,835 org.apache.spark.scheduler.TaskSetManager logInfo - Finished task 0.0 in stage 948.0 (TID 1384) in 96173 ms on localhost (2/2)
[INFO] 2016-11-03 00:12:05,835 org.apache.spark.scheduler.DAGScheduler logInfo - ResultStage 948 (insertInto at ImportHive.scala:24) finished in 96,173 s
[INFO] 2016-11-03 00:12:05,835 org.apache.spark.scheduler.TaskSchedulerImpl logInfo - Removed TaskSet 948.0, whose tasks have all completed, from pool
[INFO] 2016-11-03 00:12:05,836 org.apache.spark.scheduler.DAGScheduler logInfo - Job 948 finished: insertInto at ImportHive.scala:24, took 96,188035 s

 

 

 

[INFO] 2016-11-03 00:12:17,171 org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer logInfo - Sorting complete. Writing out partition files one at a time.
[INFO] 2016-11-03 00:12:17,296 com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream close - close closed:false s3://BUCKETS3/rescue/tsbucket=2016-11-02 09%3A00%3A00/part-r-00001-f433a41e-1b59-49af-b232-cf701e0c6df9.zlib.orc
[INFO] 2016-11-03 00:12:17,388 org.apache.spark.mapred.SparkHadoopMapRedUtil logInfo - No need to commit output of task because needsTaskCommit=false: attempt_201611030012_0949_m_000001_0
[INFO] 2016-11-03 00:12:17,388 org.apache.spark.executor.Executor logInfo - Finished task 1.0 in stage 949.0 (TID 1387). 2652 bytes result sent to driver
[INFO] 2016-11-03 00:12:17,389 org.apache.spark.scheduler.TaskSetManager logInfo - Finished task 1.0 in stage 949.0 (TID 1387) in 6892 ms on localhost (1/2)
[INFO] 2016-11-03 00:12:57,467 org.apache.spark.storage.BlockManagerInfo logInfo - Removed broadcast_2333_piece0 on 10.0.193.149:34016 in memory (size: 6.9 KB, free: 414.4 MB)
[INFO] 2016-11-03 00:13:36,195 org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer logInfo - Sorting complete. Writing out partition files one at a time.
[INFO] 2016-11-03 00:13:43,689 com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream close - close closed:false s3://BUCKETS3/rescue/tsbucket=2016-11-02 09%3A00%3A00/part-r-00000-f433a41e-1b59-49af-b232-cf701e0c6df9.zlib.orc
[INFO] 2016-11-03 00:13:44,258 org.apache.spark.mapred.SparkHadoopMapRedUtil logInfo - No need to commit output of task because needsTaskCommit=false: attempt_201611030012_0949_m_000000_0
[INFO] 2016-11-03 00:13:44,259 org.apache.spark.executor.Executor logInfo - Finished task 0.0 in stage 949.0 (TID 1386). 2652 bytes result sent to driver
[INFO] 2016-11-03 00:13:44,259 org.apache.spark.scheduler.TaskSetManager logInfo - Finished task 0.0 in stage 949.0 (TID 1386) in 93762 ms on localhost (2/2)
[INFO] 2016-11-03 00:13:44,259 org.apache.spark.scheduler.DAGScheduler logInfo - ResultStage 949 (save at ImportHive.scala:30) finished in 93,762 s
[INFO] 2016-11-03 00:13:44,259 org.apache.spark.scheduler.TaskSchedulerImpl logInfo - Removed TaskSet 949.0, whose tasks have all completed, from pool
[INFO] 2016-11-03 00:13:44,259 org.apache.spark.scheduler.DAGScheduler logInfo - Job 949 finished: save at ImportHive.scala:30, took 93,772483 s
[INFO] 2016-11-03 00:13:44,260 org.apache.hadoop.mapreduce.lib.output.DirectFileOutputCommitter cleanupJob - Nothing to clean up since no temporary files were written.
[INFO] 2016-11-03 00:13:44,260 com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream close - close closed:false s3://BUCKETS3/rescue/_SUCCESS
[INFO] 2016-11-03 00:13:44,275 org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer logInfo - Job job_201611030012_0000 committed.

 

Re: How to save each partition of a Dataframe/Dataset in parallel with partitionBy or InsertInto Hi

Expert Contributor

There are two tasks for this spark job, both should be able to run in parrallel.  It looks like there is an issue while running this locally.  How are you launching the job?  If you are setting master on command line, could there be somewhere in the code that is overriding to a single thread?

 

When launching this within YARN and multiple executors, you should see the two tasks run in parallel, but it should be possible locally as well.

Re: How to save each partition of a Dataframe/Dataset in parallel with partitionBy or InsertInto Hi

New Contributor

Hello 

hear is my code for my SparkSession in spark2.0.1: 

 

object InstanceSpark {
val spark = SparkSession.builder
.master("local[*]")
.appName("MyApp")
.enableHiveSupport().getOrCreate()
spark
}

 

Highlighted

Re: How to save each partition of a Dataframe/Dataset in parallel with partitionBy or InsertInto Hi

Expert Contributor

I've seen issues with some hardware where using local[*] doesn't use number of cores like expected.  The java method to get the number of cores available for the process isn't always consistent.  Instead, try specifying the number explicitly like local[6] and try again.