About damdr

damdr · ‎11-04-2016

Hello hear is my code for my SparkSession in spark2.0.1: object InstanceSpark { val spark = SparkSession.builder .master("local[*]") .appName("MyApp") .enableHiveSupport().getOrCreate() spark }

damdr · ‎11-03-2016

Hello hubbarja . Thanks for your feedback. You will find below an extract logs .In the first example it is the "InserInto(tablehivealreadypartitionned)" in hive. We can see that all "partitions" Spark are written one by one. In the second example it is the "partitionBy().save()" that write directly to S3. We can see also that all "partitions" spark are written one by one. The dataframe we handle only has one "partition" and the size of it is about 200MB uncompressed (in memory). The Job can Take 120s 170s to save the Data with the option local[4] . INFO] 2016-11-03 00:10:33,255 org.apache.spark.SparkContext logInfo - Created broadcast 2330 from broadcast at TorExitLookup.scala:43 [INFO] 2016-11-03 00:10:35,302 org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer logInfo - Sorting complete. Writing out partition files one at a time. [INFO] 2016-11-03 00:10:35,363 com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream close - close closed:false s3://BUCKETS3/db/.hive-staging_hive_2016-11-03_00-10-29_426_1749488585639143697-1/-ext-10000/tsbucket=2016-11-02 09%3A00%3A00/part-00001 [INFO] 2016-11-03 00:10:35,380 org.apache.spark.mapred.SparkHadoopMapRedUtil logInfo - No need to commit output of task because needsTaskCommit=false: attempt_201611030010_0948_m_000001_0 [INFO] 2016-11-03 00:10:35,380 org.apache.spark.executor.Executor logInfo - Finished task 1.0 in stage 948.0 (TID 1385). 2652 bytes result sent to driver [INFO] 2016-11-03 00:10:35,381 org.apache.spark.scheduler.TaskSetManager logInfo - Finished task 1.0 in stage 948.0 (TID 1385) in 5718 ms on localhost (1/2) [INFO] 2016-11-03 00:11:23,033 org.apache.spark.storage.BlockManagerInfo logInfo - Removed broadcast_2330_piece0 on 10.0.193.149:34016 in memory (size: 6.9 KB, free: 414.4 MB) [INFO] 2016-11-03 00:11:58,194 org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer logInfo - Sorting complete. Writing out partition files one at a time. [INFO] 2016-11-03 00:12:00,210 org.apache.spark.storage.BlockManagerInfo logInfo - Removed broadcast_2329_piece0 on 10.0.193.149:34016 in memory (size: 6.9 KB, free: 414.4 MB) [INFO] 2016-11-03 00:12:05,295 com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream close - close closed:false s3://BUCKETS3/db/.hive-staging_hive_2016-11-03_00-10-29_426_1749488585639143697-1/-ext-10000/tsbucket=2016-11-02 09%3A00%3A00/part-00000 [INFO] 2016-11-03 00:12:05,831 org.apache.spark.mapred.SparkHadoopMapRedUtil logInfo - No need to commit output of task because needsTaskCommit=false: attempt_201611030010_0948_m_000000_0 [INFO] 2016-11-03 00:12:05,835 org.apache.spark.executor.Executor logInfo - Finished task 0.0 in stage 948.0 (TID 1384). 2652 bytes result sent to driver [INFO] 2016-11-03 00:12:05,835 org.apache.spark.scheduler.TaskSetManager logInfo - Finished task 0.0 in stage 948.0 (TID 1384) in 96173 ms on localhost (2/2) [INFO] 2016-11-03 00:12:05,835 org.apache.spark.scheduler.DAGScheduler logInfo - ResultStage 948 (insertInto at ImportHive.scala:24) finished in 96,173 s [INFO] 2016-11-03 00:12:05,835 org.apache.spark.scheduler.TaskSchedulerImpl logInfo - Removed TaskSet 948.0, whose tasks have all completed, from pool [INFO] 2016-11-03 00:12:05,836 org.apache.spark.scheduler.DAGScheduler logInfo - Job 948 finished: insertInto at ImportHive.scala:24, took 96,188035 s [INFO] 2016-11-03 00:12:17,171 org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer logInfo - Sorting complete. Writing out partition files one at a time. [INFO] 2016-11-03 00:12:17,296 com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream close - close closed:false s3://BUCKETS3/rescue/tsbucket=2016-11-02 09%3A00%3A00/part-r-00001-f433a41e-1b59-49af-b232-cf701e0c6df9.zlib.orc [INFO] 2016-11-03 00:12:17,388 org.apache.spark.mapred.SparkHadoopMapRedUtil logInfo - No need to commit output of task because needsTaskCommit=false: attempt_201611030012_0949_m_000001_0 [INFO] 2016-11-03 00:12:17,388 org.apache.spark.executor.Executor logInfo - Finished task 1.0 in stage 949.0 (TID 1387). 2652 bytes result sent to driver [INFO] 2016-11-03 00:12:17,389 org.apache.spark.scheduler.TaskSetManager logInfo - Finished task 1.0 in stage 949.0 (TID 1387) in 6892 ms on localhost (1/2) [INFO] 2016-11-03 00:12:57,467 org.apache.spark.storage.BlockManagerInfo logInfo - Removed broadcast_2333_piece0 on 10.0.193.149:34016 in memory (size: 6.9 KB, free: 414.4 MB) [INFO] 2016-11-03 00:13:36,195 org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer logInfo - Sorting complete. Writing out partition files one at a time. [INFO] 2016-11-03 00:13:43,689 com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream close - close closed:false s3://BUCKETS3/rescue/tsbucket=2016-11-02 09%3A00%3A00/part-r-00000-f433a41e-1b59-49af-b232-cf701e0c6df9.zlib.orc [INFO] 2016-11-03 00:13:44,258 org.apache.spark.mapred.SparkHadoopMapRedUtil logInfo - No need to commit output of task because needsTaskCommit=false: attempt_201611030012_0949_m_000000_0 [INFO] 2016-11-03 00:13:44,259 org.apache.spark.executor.Executor logInfo - Finished task 0.0 in stage 949.0 (TID 1386). 2652 bytes result sent to driver [INFO] 2016-11-03 00:13:44,259 org.apache.spark.scheduler.TaskSetManager logInfo - Finished task 0.0 in stage 949.0 (TID 1386) in 93762 ms on localhost (2/2) [INFO] 2016-11-03 00:13:44,259 org.apache.spark.scheduler.DAGScheduler logInfo - ResultStage 949 (save at ImportHive.scala:30) finished in 93,762 s [INFO] 2016-11-03 00:13:44,259 org.apache.spark.scheduler.TaskSchedulerImpl logInfo - Removed TaskSet 949.0, whose tasks have all completed, from pool [INFO] 2016-11-03 00:13:44,259 org.apache.spark.scheduler.DAGScheduler logInfo - Job 949 finished: save at ImportHive.scala:30, took 93,772483 s [INFO] 2016-11-03 00:13:44,260 org.apache.hadoop.mapreduce.lib.output.DirectFileOutputCommitter cleanupJob - Nothing to clean up since no temporary files were written. [INFO] 2016-11-03 00:13:44,260 com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream close - close closed:false s3://BUCKETS3/rescue/_SUCCESS [INFO] 2016-11-03 00:13:44,275 org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer logInfo - Job job_201611030012_0000 committed.

damdr · ‎10-31-2016

Hello I currently use spark 2.0.1 and i try to save my dataset into a "partitioned table Hive" with insertInto() or on S3 storage with partitionBy("col") with job in concurrency (parallel). But with this 2 methods each partition of my dataset is save sequentially one by one . It's very very SLOW. I already know that I must use insertInto() or partitionBy() one at time. I assume that in spark.2.0.1 Dataframe are Resilient Data Set . My current code : df.write.mode(SaveMode.Append).partitionBy("col").save("s3://bucket/diroutput") Or df.write.mode(SaveMode.Append).insertInto("TableHivealreadypartitioned") So I try some stuff with df.foreachPartition like this : df.foreachPartition{datasetpartition => datasetpartition.foreach(row => row.sometransformation)} Unfortunately i still do not find a way to write/save in parallel each partition of my dataset. Someone already done this? Can you tell me how to proceed? Is it a wrong direction? thanks for your help

Online	Offline
Last Visited	‎11-04-2016 04:05 AM

Member Since	‎10-31-2016 02:23 PM
Last Visited	‎11-04-2016 04:05 AM
Posts	3

Cloudera Community

Re: How to save each partition of a Dataframe/Data...

Re: How to save each partition of a Dataframe/Data...

How to save each partition of a Dataframe/Dataset ...