About kevin_sievers

kevin_sievers · ‎04-20-2016

By the way, Here is the explain plan: hive> SET MAPRED.REDUCE.TASKS=10; hive> explain insert into accesslog.accesslog_new_test PARTITION (DT) select * from accesslog.accesslog DISTRIBUTE BY DT; OK Plan not optimized by CBO. Vertex dependency in root stage Reducer 2 <- Map 1 (SIMPLE_EDGE) Reducer 3 <- Reducer 2 (SIMPLE_EDGE) Stage-3 Stats-Aggr Operator Stage-0 Move Operator partition:{} table:{"serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde","name:":"accesslog.accesslog_new_test","input format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat"} Stage-2 Dependency Collection{} Stage-1 Reducer 3 File Output Operator [FS_6] compressed:true Statistics:Num rows: 759094090 Data size: 139673312560 Basic stats: COMPLETE Column stats: PARTIAL table:{"serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde","name:":"accesslog.accesslog_new_test","input format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat"} Select Operator [SEL_5] | outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12"] | Statistics:Num rows: 759094090 Data size: 139673312560 Basic stats: COMPLETE Column stats: PARTIAL |<-Reducer 2 [SIMPLE_EDGE] Reduce Output Operator [RS_4] Map-reduce partition columns:_col1 (type: string) sort order: Statistics:Num rows: 759094090 Data size: 139673312560 Basic stats: COMPLETE Column stats: PARTIAL value expressions:_col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: map<string,string>), _col12 (type: string) Select Operator [SEL_3] | outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12"] | Statistics:Num rows: 759094090 Data size: 139673312560 Basic stats: COMPLETE Column stats: PARTIAL |<-Map 1 [SIMPLE_EDGE] Reduce Output Operator [RS_2] Map-reduce partition columns:_col12 (type: string) sort order: Statistics:Num rows: 759094090 Data size: 11901499527 Basic stats: COMPLETE Column stats: PARTIAL value expressions:_col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: map<string,string>), _col12 (type: string) Select Operator [SEL_1] outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12"] Statistics:Num rows: 759094090 Data size: 11901499527 Basic stats: COMPLETE Column stats: PARTIAL TableScan [TS_0] alias:accesslog Statistics:Num rows: 759094090 Data size: 11901499527 Basic stats: COMPLETE Column stats: PARTIAL

kevin_sievers · ‎04-20-2016

I ran the following yesterday afternoon and it took about the same time as the original copy. Here's the results which are about the same as the original test which did not set the mapred.reduce.tasks and without the distribute clause. 5.5 hours to copy the table with about 750 million rows in our test system. Are my expectations off base that this should be something Hadoop can do much faster? SET MAPRED.REDUCE.TASKS=10; insert into accesslog.accesslog_new_test PARTITION (DT) select * from accesslog.accesslog DISTRIBUTE BY DT; Status: Running (Executing on YARN cluster with App id application_1460584845937_0014) -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 .......... SUCCEEDED 1 1 0 0 0 0 Reducer 2 ...... SUCCEEDED 356 356 0 0 0 0 Reducer 3 ...... SUCCEEDED 1 1 0 0 0 0 -------------------------------------------------------------------------------- VERTICES: 03/03 [==========================>>] 100% ELAPSED TIME: 19286.39 s -------------------------------------------------------------------------------- Loading data to table accesslog.accesslog_new_test partition (dt=null) Time taken for load dynamic partitions : 15703 Loading partition {dt=140125} Loading partition {dt=140124} Loading partition {dt=140202} Loading partition {dt=130210} ... Loading partition {dt=140123} Loading partition {dt=160214} Loading partition {dt=140105} Time taken for adding to write entity : 27 Partition accesslog.accesslog_new_test{dt=130129} stats: [numFiles=1, numRows=2686851, totalSize=46373620, rawDataSize=0] Partition accesslog.accesslog_new_test{dt=130130} stats: [numFiles=1, numRows=14329847, totalSize=240074232, rawDataSize=0] Partition accesslog.accesslog_new_test{dt=130131} stats: [numFiles=1, numRows=14059931, totalSize=235402253, rawDataSize=0] Partition accesslog.accesslog_new_test{dt=130201} stats: [numFiles=1, numRows=12976777, totalSize=223349207, rawDataSize=0] Partition accesslog.accesslog_new_test{dt=130202} stats: [numFiles=1, numRows=4335922, totalSize=71477106, rawDataSize=0] ... Partition accesslog.accesslog_new_test{dt=160229} stats: [numFiles=1, numRows=11316976, totalSize=217067108, rawDataSize=0] Partition accesslog.accesslog_new_test{dt=160301} stats: [numFiles=1, numRows=8730842, totalSize=162107356, rawDataSize=0] OK Time taken: 19320.464 seconds

kevin_sievers · ‎04-18-2016

Thanks for the responses guys. I'm still working through this on our test cluster with a smaller set of data. 750+ million rows. But so far I've not had any luck. Apolologies for the formatting issues, but here's an example along with the explain plan. I wanted to give it enough time to hit the reducer step before I declared victory or defeat on the 'set mapreduce.job.reduces=10;' setting. I'll try more options when I get the time. hive> set mapreduce.job.reduces=10; hive> explain insert into accesslog_new PARTITION (DT) select * from accesslog; OK Plan not optimized by CBO. Vertex dependency in root stage Reducer 2 <- Map 1 (SIMPLE_EDGE) Stage-3 Stats-Aggr Operator Stage-0 Move Operator partition:{} table:{"serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde","name:":"accesslog.accesslog_new","input format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat"} Stage-2 Dependency Collection{} Stage-1 Reducer 2 File Output Operator [FS_4] compressed:true Statistics:Num rows: 759094090 Data size: 139673312560 Basic stats: COMPLETE Column stats: PARTIAL table:{"serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde","name:":"accesslog.accesslog_new","input format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat"} Select Operator [SEL_3] | outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12"] | Statistics:Num rows: 759094090 Data size: 139673312560 Basic stats: COMPLETE Column stats: PARTIAL |<-Map 1 [SIMPLE_EDGE] Reduce Output Operator [RS_2] Map-reduce partition columns:_col1 (type: string) sort order: Statistics:Num rows: 759094090 Data size: 11901499527 Basic stats: COMPLETE Column stats: PARTIAL value expressions:_col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: map<string,string>), _col12 (type: string) Select Operator [SEL_1] outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12"] Statistics:Num rows: 759094090 Data size: 11901499527 Basic stats: COMPLETE Column stats: PARTIAL TableScan [TS_0] alias:accesslog Statistics:Num rows: 759094090 Data size: 11901499527 Basic stats: COMPLETE Column stats: PARTIAL Time taken: 6.483 seconds, Fetched: 35 row(s) hive> insert into accesslog_new PARTITION (DT) select * from accesslog; Query ID = svchadoop_20160418123532_e3be4118-b866-4b88-ae52-42d1e574a9fe Total jobs = 1 Launching Job 1 out of 1 Tez session was closed. Reopening... Session re-established. Status: Running (Executing on YARN cluster with App id application_1460584845937_0011) -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 .......... SUCCEEDED 1 1 0 0 0 0 Reducer 2 RUNNING 1 0 1 0 0 0 -------------------------------------------------------------------------------- VERTICES: 01/02 [=============>>-------------] 50% ELAPSED TIME: 5094.71 s --------------------------------------------------------------------------------

kevin_sievers · ‎04-18-2016

I'm attempting to copy 30 billion rows from one hive table into another hive table. The tables are both created the same and are partitioned on date (DT). Currently there are 1173 partitions. I'm using the following query: insert into accesslog_new PARTITION (DT) select * from accesslog; This query has been running for almost 3 days straight on a cluster with 18 data nodes. My issue is that the Map-Reduce job only creates one reducer step. Btw, we are using MR2. I'm guessing this is drastically slowing things down. Is there a way to force the number of reducers to be much larger? How do you also figure out what an appropriate number of reducers would be for that volume of data?

Online	Offline
Last Visited	‎05-31-2016 08:13 PM

Member Since	‎04-18-2016 03:32 PM
Last Visited	‎05-31-2016 08:13 PM
Posts	4
Kudos received	3

Cloudera Community

Re: How do you force the number of reducers in a m...

Re: How do you force the number of reducers in a m...

Re: How do you force the number of reducers in a m...

How do you force the number of reducers in a map r...