Member since
04-18-2016
4
Posts
3
Kudos Received
0
Solutions
04-20-2016
01:55 PM
By the way, Here is the explain plan: hive> SET MAPRED.REDUCE.TASKS=10; hive> explain insert into accesslog.accesslog_new_test
PARTITION (DT) select * from accesslog.accesslog DISTRIBUTE BY DT; OK Plan not optimized by CBO. Vertex dependency in root stage Reducer 2 <- Map 1 (SIMPLE_EDGE) Reducer 3 <- Reducer 2 (SIMPLE_EDGE) Stage-3 Stats-Aggr Operator Stage-0 Move Operator
partition:{}
table:{"serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde","name:":"accesslog.accesslog_new_test","input
format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output
format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat"} Stage-2
Dependency Collection{}
Stage-1
Reducer 3
File Output Operator [FS_6]
compressed:true
Statistics:Num rows: 759094090 Data size: 139673312560 Basic stats:
COMPLETE Column stats: PARTIAL
table:{"serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde","name:":"accesslog.accesslog_new_test","input
format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output
format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat"}
Select Operator [SEL_5]
|
outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12"]
| Statistics:Num rows: 759094090
Data size: 139673312560 Basic stats: COMPLETE Column stats: PARTIAL
|<-Reducer 2 [SIMPLE_EDGE]
Reduce Output Operator [RS_4] Map-reduce partition columns:_col1
(type: string)
sort order:
Statistics:Num rows: 759094090 Data size: 139673312560 Basic stats:
COMPLETE Column stats: PARTIAL
value expressions:_col0 (type:
string), _col1 (type: string), _col2 (type: string), _col3 (type: string),
_col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type:
string), _col8 (type: string), _col9 (type: string), _col10 (type: string),
_col11 (type: map<string,string>), _col12 (type: string)
Select Operator [SEL_3]
|
outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12"]
| Statistics:Num rows: 759094090
Data size: 139673312560 Basic stats: COMPLETE Column stats: PARTIAL
|<-Map 1 [SIMPLE_EDGE] Reduce Output
Operator [RS_2] Map-reduce
partition columns:_col12 (type: string) sort order:
Statistics:Num rows: 759094090 Data size: 11901499527 Basic stats: COMPLETE
Column stats: PARTIAL value
expressions:_col0 (type: string), _col1 (type: string), _col2 (type: string),
_col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type:
string), _col7 (type: string), _col8 (type: string), _col9 (type: string),
_col10 (type: string), _col11 (type: map<string,string>), _col12 (type:
string) Select
Operator [SEL_1]
outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12"]
Statistics:Num rows: 759094090 Data size: 11901499527 Basic stats:
COMPLETE Column stats: PARTIAL
TableScan [TS_0]
alias:accesslog
Statistics:Num rows: 759094090 Data size: 11901499527 Basic stats:
COMPLETE Column stats: PARTIAL
... View more
04-20-2016
01:54 PM
I ran the following yesterday afternoon and it took about the same time as the original copy. Here's the results which are about the same as the original test which did not set the mapred.reduce.tasks and without the distribute clause. 5.5 hours to copy the table with about 750 million rows in our test system. Are my expectations off base that this should be something Hadoop can do much faster? SET MAPRED.REDUCE.TASKS=10; insert into
accesslog.accesslog_new_test PARTITION (DT) select * from accesslog.accesslog DISTRIBUTE BY DT; Status: Running (Executing on
YARN cluster with App id application_1460584845937_0014) --------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED
RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 ..........
SUCCEEDED
1
1
0
0 0 0 Reducer 2 ......
SUCCEEDED 356
356
0
0 0 0 Reducer 3 ......
SUCCEEDED
1
1
0
0 0 0 -------------------------------------------------------------------------------- VERTICES: 03/03
[==========================>>] 100% ELAPSED TIME: 19286.39 s -------------------------------------------------------------------------------- Loading data to table
accesslog.accesslog_new_test partition (dt=null)
Time taken for load dynamic partitions : 15703
Loading partition {dt=140125}
Loading partition {dt=140124}
Loading partition {dt=140202}
Loading partition {dt=130210} ... Loading partition {dt=140123}
Loading partition {dt=160214}
Loading partition {dt=140105}
Time taken for adding to write entity : 27 Partition
accesslog.accesslog_new_test{dt=130129} stats: [numFiles=1, numRows=2686851,
totalSize=46373620, rawDataSize=0] Partition
accesslog.accesslog_new_test{dt=130130} stats: [numFiles=1, numRows=14329847,
totalSize=240074232, rawDataSize=0] Partition
accesslog.accesslog_new_test{dt=130131} stats: [numFiles=1, numRows=14059931,
totalSize=235402253, rawDataSize=0] Partition
accesslog.accesslog_new_test{dt=130201} stats: [numFiles=1, numRows=12976777,
totalSize=223349207, rawDataSize=0] Partition
accesslog.accesslog_new_test{dt=130202} stats: [numFiles=1, numRows=4335922,
totalSize=71477106, rawDataSize=0] ... Partition
accesslog.accesslog_new_test{dt=160229} stats: [numFiles=1, numRows=11316976,
totalSize=217067108, rawDataSize=0] Partition
accesslog.accesslog_new_test{dt=160301} stats: [numFiles=1, numRows=8730842,
totalSize=162107356, rawDataSize=0] OK Time taken: 19320.464 seconds
... View more
04-18-2016
07:09 PM
Thanks for the responses guys. I'm still working through this on our test cluster with a smaller set of data. 750+ million rows. But so far I've not had any luck. Apolologies for the formatting issues, but here's an example along with the explain plan. I wanted to give it enough time to hit the reducer step before I declared victory or defeat on the 'set mapreduce.job.reduces=10;' setting. I'll try more options when I get the time. hive> set mapreduce.job.reduces=10; hive> explain insert into accesslog_new PARTITION (DT) select * from accesslog; OK Plan not optimized by CBO. Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE) Stage-3
Stats-Aggr Operator
Stage-0
Move Operator
partition:{}
table:{"serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde","name:":"accesslog.accesslog_new","input format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat"}
Stage-2
Dependency Collection{}
Stage-1
Reducer 2
File Output Operator [FS_4]
compressed:true
Statistics:Num rows: 759094090 Data size: 139673312560 Basic stats: COMPLETE Column stats: PARTIAL
table:{"serde:":"org.apache.hadoop.hive.ql.io.orc.OrcSerde","name:":"accesslog.accesslog_new","input format:":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","output format:":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat"}
Select Operator [SEL_3]
| outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12"]
| Statistics:Num rows: 759094090 Data size: 139673312560 Basic stats: COMPLETE Column stats: PARTIAL
|<-Map 1 [SIMPLE_EDGE]
Reduce Output Operator [RS_2]
Map-reduce partition columns:_col1 (type: string)
sort order:
Statistics:Num rows: 759094090 Data size: 11901499527 Basic stats: COMPLETE Column stats: PARTIAL
value expressions:_col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: map<string,string>), _col12 (type: string)
Select Operator [SEL_1]
outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12"]
Statistics:Num rows: 759094090 Data size: 11901499527 Basic stats: COMPLETE Column stats: PARTIAL
TableScan [TS_0]
alias:accesslog
Statistics:Num rows: 759094090 Data size: 11901499527 Basic stats: COMPLETE Column stats: PARTIAL Time taken: 6.483 seconds, Fetched: 35 row(s)
hive> insert into accesslog_new PARTITION (DT) select * from accesslog;
Query ID = svchadoop_20160418123532_e3be4118-b866-4b88-ae52-42d1e574a9fe
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established. Status: Running (Executing on YARN cluster with App id application_1460584845937_0011)
-------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Reducer 2 RUNNING 1 0 1 0 0 0
--------------------------------------------------------------------------------
VERTICES: 01/02 [=============>>-------------] 50% ELAPSED TIME: 5094.71 s
--------------------------------------------------------------------------------
... View more
04-18-2016
04:04 PM
3 Kudos
I'm attempting to copy 30 billion rows from one hive table into another hive table. The tables are both created the same and are partitioned on date (DT). Currently there are 1173 partitions. I'm using the following query: insert into accesslog_new PARTITION (DT) select * from accesslog; This query has been running for almost 3 days straight on a cluster with 18 data nodes. My issue is that the Map-Reduce job only creates one reducer step. Btw, we are using MR2. I'm guessing this is drastically slowing things down. Is there a way to force the number of reducers to be much larger? How do you also figure out what an appropriate number of reducers would be for that volume of data?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive