Created on 11-03-2015 12:43 AM - edited 09-16-2022 02:47 AM
Hi
I set up CDH 5.4.8 to AWS.
hive> set hive.auto.convert.join;
hive.auto.convert.join=true
Executing query:
select AVG(commute_time) AS avg_commute_time from transportation JOIN person p ON persons = id WHERE commute_time < 10 GROUP BY nationality;
Total jobs = 1
Execution log at: /tmp/demo/demo_20151103083131_9c7f8815-a9fd-42d2-b7a3-9028fcdd81fb.log
2015-11-03 08:31:49 Starting to launch local task to process map join; maximum memory = 1029701632
2015-11-03 08:31:51 Processing rows: 200000 Hashtable size: 199999 Memory usage: 118909728 percentage: 0.115
2015-11-03 08:31:51 Processing rows: 300000 Hashtable size: 299999 Memory usage: 160068072 percentage: 0.155
2015-11-03 08:31:51 Dump the side-table for tag: 1 with group count: 300000 into file: file:/tmp/demo/facf066b-778f-447e-86b5-a9152a74c41b/hive_2015-11-03_08-31-44_617_3475565247267897601-1/-local-10004/HashTable-Stage-2/MapJoin-mapfile21--.hashtable
2015-11-03 08:31:51 Uploaded 1 File to: file:/tmp/demo/facf066b-778f-447e-86b5-a9152a74c41b/hive_2015-11-03_08-31-44_617_3475565247267897601-1/-local-10004/HashTable-Stage-2/MapJoin-mapfile21--.hashtable (20655402 bytes)
2015-11-03 08:31:51 End of local task; Time Taken: 2.361 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1446191535269_0136, Tracking URL = http://[server]:8088/proxy/application_1446191535269_0136/
Kill Command = /opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/lib/hadoop/bin/hadoop job -kill job_1446191535269_0136
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2015-11-03 08:32:00,237 Stage-2 map = 0%, reduce = 0%
2015-11-03 08:32:11,603 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 10.73 sec
2015-11-03 08:32:18,828 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 12.44 sec
MapReduce Total cumulative CPU time: 12 seconds 440 msec
Ended Job = job_1446191535269_0136
MapReduce Jobs Launched:
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 12.44 sec HDFS Read: 64673839 HDFS Write: 84 SUCCESS
Total MapReduce CPU Time Spent: 12 seconds 440 msec
OK
9.22561984510033
6.97536844275076
3.4043091344593
8.97108984313809
9.47274416580304
Time taken: 36.344 seconds, Fetched: 5 row(s)
When I execute the same query via beeline:
0: jdbc:hive2://localhost:10000> select AVG(commute_time) AS avg_commute_time from transportation JOIN person p ON persons = id WHERE commute_time < 10 GROUP BY nationality;
ERROR : Execution failed with exit status: 3
ERROR : Obtaining error information
ERROR :
Task failed!
Task ID:
Stage-5
Logs:
ERROR : /var/log/hive/hadoop-cmf-hive-HIVESERVER2-[server].log.out
Getting log thread is interrupted, since query is done!
Error: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask (state=08S01,code=3)
java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:279)
at org.apache.hive.beeline.Commands.execute(Commands.java:814)
at org.apache.hive.beeline.Commands.sql(Commands.java:670)
at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:974)
at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:810)
at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:767)
at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480)
at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
0: jdbc:hive2://localhost:10000>
But when I turn hive.auto.convert.join to false it works:
0: jdbc:hive2://localhost:10000> set hive.auto.convert.join=false;
0: jdbc:hive2://localhost:10000> select AVG(commute_time) AS avg_commute_time from transportation JOIN person p ON persons = id WHERE commute_time < 10 GROUP BY nationality;
INFO : Number of reduce tasks not specified. Estimated from input data size: 2
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
WARN : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO : number of splits:2
INFO : Submitting tokens for job: job_1446191535269_0137
INFO : The url to track the job: http://[ip]:8088/proxy/application_1446191535269_0137/
INFO : Starting Job = job_1446191535269_0137, Tracking URL = http://[ip]:8088/proxy/application_1446191535269_0137/
INFO : Kill Command = /opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/lib/hadoop/bin/hadoop job -kill job_1446191535269_0137
INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 2
INFO : 2015-11-03 08:37:40,662 Stage-1 map = 0%, reduce = 0%
INFO : 2015-11-03 08:37:48,910 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 4.66 sec
INFO : 2015-11-03 08:37:49,941 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 9.75 sec
INFO : 2015-11-03 08:37:58,192 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 16.76 sec
INFO : MapReduce Total cumulative CPU time: 16 seconds 760 msec
INFO : Ended Job = job_1446191535269_0137
INFO : Number of reduce tasks not specified. Estimated from input data size: 1
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
WARN : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO : number of splits:1
INFO : Submitting tokens for job: job_1446191535269_0138
INFO : The url to track the job: http://[ip]:8088/proxy/application_1446191535269_0138/
INFO : Starting Job = job_1446191535269_0138, Tracking URL = http://[ip]:8088/proxy/application_1446191535269_0138/
INFO : Kill Command = /opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/lib/hadoop/bin/hadoop job -kill job_1446191535269_0138
INFO : Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
INFO : 2015-11-03 08:38:05,525 Stage-2 map = 0%, reduce = 0%
INFO : 2015-11-03 08:38:12,738 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 1.18 sec
INFO : 2015-11-03 08:38:18,932 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 2.87 sec
INFO : MapReduce Total cumulative CPU time: 2 seconds 870 msec
INFO : Ended Job = job_1446191535269_0138
Getting log thread is interrupted, since query is done!
+--------------------------------+--+
| avg_commute_time |
+--------------------------------+--+
| 9.22561984510033 |
| 6.97536844275076 |
| 3.4043091344593 |
| 8.97108984313809 |
| 9.47274416580304 |
+--------------------------------+--+
5 rows selected (46.722 seconds)
0: jdbc:hive2://localhost:10000>
How to explain that? I'd like to set hive.auto.convert.join=true because users who make queries via HUE do not know what is map side joins.
Best regards, Margus (margusja) Roo