Created 08-07-2018 09:32 AM
When executing insert into empty table from large table with millions records( 20GB size). The execution is different in hive CLI and beeline.
Hive CLI: It creates two TEZ jobs in Yarn, maybe mapper and reducer and completes in approx 413sec.
Beeline: it creates first TEZ job in Yarn and other are MAPREDUCE jobs which is more than 150 jobs and it takes almost 2 hours.
is it the expected behavior of hiveserver2 beeline for TEZ job, since internally it creates MAPREDUCE job?
Environment details:
Hive version: 2.1.1
Tez version: 0.8.5
hive-cli.txtbeeline-jdbc-hs2.txtbeeline-jdbc-hs2.txt
hive common settings:
hive.execution.engine=tez
hive.mv.files.thread=0
beeline setting:
tez.am.resource.memory.mb=20000
mapreduce.map.memory.mb=20000
hive.vectorized.execution.reduce.enabled=false;
Hive CLI log and Beeline logs uploaded.
Thanks in advance.
Created 08-21-2018 10:08 AM
Update:
It is found that .hiverc was used in hive user for HIVE CLI so the difference was found.
hive.exec.scratchdir=/user/hive/scratch
hive.exec.stagingdir=/user/hive/staging
The issue is hdfs /user/hive directory is encrypted with Ranger and hdfs /tmp/hive directory is non-encrypted and can read/write by all user in hadoop group.
hive-site.xml
<property>
<name>hive.security.authorization.sqlstd.confwhitelist.append</name>
<value>hive\.exec\.scratchdir|hive\.exec\.stagingdir</value>
<description>append conf property in white list followed by pipeline</description>
</property>
Restart the metastore and hiveserver.
I tested with beeline with session level change . The execution is fast like HIVE CLI .
hive.exec.scratchdir=/user/hive/scratch
hive.exec.stagingdir=/user/hive/staging
I tested with HIVE CLI with session level change. The execution is slow with MAP reduce Job for moving data.
hive.exec.scratchdir=/tmp/hive/scratch
hive.exec.stagingdir=/tmp/hive/staging
So the root cause is data are encrypted in /user/hive and not encrypted in /tmp/hive.
Solution is to make ssession level change to use same encryption zone.
So below INFO log will be printed if the encryption zones are different.
metadata.Hive: Copying source hdfs://edhcluster/tmp/hive/staging_hive_2018-08-07_16- 29-12_750_8973639287951385407-1/-ext-10000/000001_0 to hdfs://edhcluster/user/hive/warehouse/temp_tro/000001_0 because HDFS encryption zones are different.
Thanks,
Manjil
Created 08-09-2018 11:28 PM
Hi @manjil subba!
Just asking, but, did you apply the same parameter for both (hiveCLI/beeline)?
tez.am.resource.memory.mb=20000 mapreduce.map.memory.mb=20000 hive.vectorized.execution.reduce.enabled=false;
And answering your question, for the job afaik, they should have the same behaviour.
The only thing would be that beeline access the HS2/Thrift and hivecli does not.
Hope this helps!
Created 08-10-2018 01:37 AM
Thanks for the response.
The parameter mentioned is only used for Beeline, Since the job was failing in Mapreduce copy job with error virtual memory used was 18G and allocated was 16.2G in yarn.
Just to explain more on the difference observed in log of hive CLI and beeline, the hdfs temp directory used are different.
Is there any configuration we need to modify to make same?
Hive CLI: hdfs://edhcluster/user/hive/staging_hive_2018-08-07_18-22-53_167_2618699013418541798-1/-ext-10001
Beeline : hdfs://edhcluster/tmp/hive/staging_hive_2018-08-07_16 -29-12_750_8973639287951385407-1/-ext-10001
Hive Cli log:
2018-08-07T18:22:56,601 INFO [main] exec.Utilities: Setting plan: /tmp/hive/scratch/hive/a501276d-2015-435b-85c5-4d40534ac162/hive_2018-08-07_18-22-53_167_2618699013418541798-1/hive/_tez_scratch_dir/d5cc1718-38b1-49ba-a97e-ab9f78415b62/map.xml
2018-08-07T18:22:56,669 INFO [main] fs.FSStatsPublisher: created : hdfs://edhcluster/user/hive/staging_hive_2018-08-07_18-22-53_167_2618699013418541798-1/-ext-10001
2018-08-07T18:22:56,686 INFO [main] client.TezClient: Submitting dag to TezSession, sessionName=HIVE-a501276d-2015-435b-85c5-4d40534ac162, applicationId=application_1533623337748_0376, dagName=insert into default.t...db.temp_large_table3(Stage-1), callerContext={ context=HIVE, callerType=HIVE_QUERY_ID,
Beeline log:
2018-08-07T16:29:13,903 INFO [HiveServer2-Background-Pool: Thread-1549] exec.Utilities: Setting plan: /tmp/hive/scratch/hive/0887b266-675a-4fb2-8c85-3a27ebb 3b9fc/hive_2018-08-07_16-29-12_750_8973639287951385407-3/hive/_tez_scratch_dir/6f4620d8-310c-4aff-bbe8-6f69ea9d1341/map.xml
2018-08-07T16:29:13,934 INFO [HiveServer2-Background-Pool: Thread-1549] fs.FSStatsPublisher: created : hdfs://edhcluster/tmp/hive/staging_hive_2018-08-07_16 -29-12_750_8973639287951385407-1/-ext-10001
2018-08-07T16:29:13,938 INFO [HiveServer2-Background-Pool: Thread-1549] client.TezClient: Submitting dag to TezSession, sessionName=HIVE-e2dfe4df-37f0-4d95- 946d-30557075f807, applicationId=application_1533623337748_0148, dagName=insert into default.t...db.temp_large_table3(Stage-1), callerContext={ context=HIVE, callerType=HIVE_QUERY_ID, callerId=hive_20180807162912_519c1503-c151-4da7-b5a2-bd067e9c42b9 }
Thanks.
Manjil
Created 08-13-2018 02:08 PM
Hello @manjil subba!
Sorry for the long delay.
I've asked for those parameters, cause they seem a little bit higher than usual (that's probably the reason, why beeline is taking much longer than hiveCli). And also the vectorization set to false should impact the reduce process as well.
You can adjust the TEZ performance by looking at this link below:
https://community.hortonworks.com/articles/14309/demystify-tez-tuning-step-by-step.html
And what you can do to check are both settings (hivecli/beeline) is:
beeline -u 'jdbc:hive2://<HS2>:10000/default' -e "set;" > /tmp/beeline.properties
hive -e "set;" > /tmp/hivecli.properties diff /tmp/beeline.properties /tmp/hivecli.properties
BTW, in your beeline logs I didn't note the following message:
Closing Tez Session
Maybe we can ensure that all parameters are equal (beeline x hivecli) and after that, enable the debug level for beeline, to check what's going on.
Hope this helps! 🙂
Created on 08-20-2018 06:34 AM - edited 08-17-2019 08:15 PM
Dear @Vinicius Higa Murakami,
Sorry for late response.
I just got the properties differences in both hivecli and beeline from client machine.
The differences are the hive.exec.scratchdir and hive.exec.stagingdir.
I have uploaded the snapshot.
I have tried this method to get the hive-site.xml for hive CLI but no output result for grep.
<code>hive --hiveconf hive.root.logger=DEBUG,console -e '' 2>&1 | grep hive-site.xml
Please suggest how to make the config hive-site.xml same for both executions.
Thanks and regards,
Manjil
Created 08-21-2018 10:08 AM
Update:
It is found that .hiverc was used in hive user for HIVE CLI so the difference was found.
hive.exec.scratchdir=/user/hive/scratch
hive.exec.stagingdir=/user/hive/staging
The issue is hdfs /user/hive directory is encrypted with Ranger and hdfs /tmp/hive directory is non-encrypted and can read/write by all user in hadoop group.
hive-site.xml
<property>
<name>hive.security.authorization.sqlstd.confwhitelist.append</name>
<value>hive\.exec\.scratchdir|hive\.exec\.stagingdir</value>
<description>append conf property in white list followed by pipeline</description>
</property>
Restart the metastore and hiveserver.
I tested with beeline with session level change . The execution is fast like HIVE CLI .
hive.exec.scratchdir=/user/hive/scratch
hive.exec.stagingdir=/user/hive/staging
I tested with HIVE CLI with session level change. The execution is slow with MAP reduce Job for moving data.
hive.exec.scratchdir=/tmp/hive/scratch
hive.exec.stagingdir=/tmp/hive/staging
So the root cause is data are encrypted in /user/hive and not encrypted in /tmp/hive.
Solution is to make ssession level change to use same encryption zone.
So below INFO log will be printed if the encryption zones are different.
metadata.Hive: Copying source hdfs://edhcluster/tmp/hive/staging_hive_2018-08-07_16- 29-12_750_8973639287951385407-1/-ext-10000/000001_0 to hdfs://edhcluster/user/hive/warehouse/temp_tro/000001_0 because HDFS encryption zones are different.
Thanks,
Manjil