question Re: Hive CLI and Beeline jdbc:hive2 behave differently in execution engine tez for insert million records? in Archives of Support Questions (Read Only)

Hive CLI and Beeline jdbc:hive2 behave differently in execution engine tez for insert million records?

manjilhk — Tue, 07 Aug 2018 16:32:14 GMT

When executing insert into empty table from large table with millions records( 20GB size). The execution is different in hive CLI and beeline.

Hive CLI: It creates two TEZ jobs in Yarn, maybe mapper and reducer and completes in approx 413sec.

Beeline: it creates first TEZ job in Yarn and other are MAPREDUCE jobs which is more than 150 jobs and it takes almost 2 hours.

is it the expected behavior of hiveserver2 beeline for TEZ job, since internally it creates MAPREDUCE job?

Environment details:

Hive version: 2.1.1

Tez version: 0.8.5

hive-cli.txt beeline-jdbc-hs2.txt beeline-jdbc-hs2.txt

hive common settings:

hive.execution.engine=tez

hive.mv.files.thread=0

beeline setting:

tez.am.resource.memory.mb=20000

mapreduce.map.memory.mb=20000

hive.vectorized.execution.reduce.enabled=false;

Hive CLI log and Beeline logs uploaded.

Thanks in advance.

Re: Hive CLI and Beeline jdbc:hive2 behave differently in execution engine tez for insert million records?

vmurakami — Fri, 10 Aug 2018 06:28:05 GMT

Hi @manjil subba!
Just asking, but, did you apply the same parameter for both (hiveCLI/beeline)?

tez.am.resource.memory.mb=20000
mapreduce.map.memory.mb=20000
hive.vectorized.execution.reduce.enabled=false;

And answering your question, for the job afaik, they should have the same behaviour.
The only thing would be that beeline access the HS2/Thrift and hivecli does not.

Hope this helps!

Re: Hive CLI and Beeline jdbc:hive2 behave differently in execution engine tez for insert million records?

manjilhk — Fri, 10 Aug 2018 08:37:47 GMT

Hi @Vinicius Higa Murakami ,

Thanks for the response.

The parameter mentioned is only used for Beeline, Since the job was failing in Mapreduce copy job with error virtual memory used was 18G and allocated was 16.2G in yarn.

Just to explain more on the difference observed in log of hive CLI and beeline, the hdfs temp directory used are different.

Is there any configuration we need to modify to make same?

Hive CLI: hdfs://edhcluster/user/hive/staging_hive_2018-08-07_18-22-53_167_2618699013418541798-1/-ext-10001

Beeline : hdfs://edhcluster/tmp/hive/staging_hive_2018-08-07_16 -29-12_750_8973639287951385407-1/-ext-10001

Hive Cli log:

2018-08-07T18:22:56,601 INFO [main] exec.Utilities: Setting plan: /tmp/hive/scratch/hive/a501276d-2015-435b-85c5-4d40534ac162/hive_2018-08-07_18-22-53_167_2618699013418541798-1/hive/_tez_scratch_dir/d5cc1718-38b1-49ba-a97e-ab9f78415b62/map.xml

2018-08-07T18:22:56,669 INFO [main] fs.FSStatsPublisher: created : hdfs://edhcluster/user/hive/staging_hive_2018-08-07_18-22-53_167_2618699013418541798-1/-ext-10001

2018-08-07T18:22:56,686 INFO [main] client.TezClient: Submitting dag to TezSession, sessionName=HIVE-a501276d-2015-435b-85c5-4d40534ac162, applicationId=application_1533623337748_0376, dagName=insert into default.t...db.temp_large_table3(Stage-1), callerContext={ context=HIVE, callerType=HIVE_QUERY_ID,

Beeline log:

2018-08-07T16:29:13,903 INFO [HiveServer2-Background-Pool: Thread-1549] exec.Utilities: Setting plan: /tmp/hive/scratch/hive/0887b266-675a-4fb2-8c85-3a27ebb 3b9fc/hive_2018-08-07_16-29-12_750_8973639287951385407-3/hive/_tez_scratch_dir/6f4620d8-310c-4aff-bbe8-6f69ea9d1341/map.xml

2018-08-07T16:29:13,934 INFO [HiveServer2-Background-Pool: Thread-1549] fs.FSStatsPublisher: created : hdfs://edhcluster/tmp/hive/staging_hive_2018-08-07_16 -29-12_750_8973639287951385407-1/-ext-10001

2018-08-07T16:29:13,938 INFO [HiveServer2-Background-Pool: Thread-1549] client.TezClient: Submitting dag to TezSession, sessionName=HIVE-e2dfe4df-37f0-4d95- 946d-30557075f807, applicationId=application_1533623337748_0148, dagName=insert into default.t...db.temp_large_table3(Stage-1), callerContext={ context=HIVE, callerType=HIVE_QUERY_ID, callerId=hive_20180807162912_519c1503-c151-4da7-b5a2-bd067e9c42b9 }

Thanks.

Manjil

Re: Hive CLI and Beeline jdbc:hive2 behave differently in execution engine tez for insert million records?

vmurakami — Mon, 13 Aug 2018 21:08:34 GMT

Hello @manjil subba!
Sorry for the long delay.
I've asked for those parameters, cause they seem a little bit higher than usual (that's probably the reason, why beeline is taking much longer than hiveCli). And also the vectorization set to false should impact the reduce process as well.

You can adjust the TEZ performance by looking at this link below:

https://community.hortonworks.com/articles/14309/demystify-tez-tuning-step-by-step.html

And what you can do to check are both settings (hivecli/beeline) is:

beeline -u 'jdbc:hive2://<HS2>:10000/default' -e "set;" > /tmp/beeline.properties
hive -e "set;" > /tmp/hivecli.properties
diff /tmp/beeline.properties /tmp/hivecli.properties

BTW, in your beeline logs I didn't note the following message:

Closing Tez Session

Maybe we can ensure that all parameters are equal (beeline x hivecli) and after that, enable the debug level for beeline, to check what's going on.

Hope this helps! 🙂

Re: Hive CLI and Beeline jdbc:hive2 behave differently in execution engine tez for insert million records?

manjilhk — Sun, 18 Aug 2019 03:15:19 GMT

Dear @Vinicius Higa Murakami,

Sorry for late response.

I just got the properties differences in both hivecli and beeline from client machine.

The differences are the hive.exec.scratchdir and hive.exec.stagingdir.

I have uploaded the snapshot.

I have tried this method to get the hive-site.xml for hive CLI but no output result for grep.

<code>hive --hiveconf hive.root.logger=DEBUG,console -e '' 2>&1 | grep hive-site.xml

Please suggest how to make the config hive-site.xml same for both executions.

Thanks and regards,

Manjil

Re: Hive CLI and Beeline jdbc:hive2 behave differently in execution engine tez for insert million records?

manjilhk — Tue, 21 Aug 2018 17:08:42 GMT

Update:

It is found that .hiverc was used in hive user for HIVE CLI so the difference was found.

hive.exec.scratchdir=/user/hive/scratch

hive.exec.stagingdir=/user/hive/staging

The issue is hdfs /user/hive directory is encrypted with Ranger and hdfs /tmp/hive directory is non-encrypted and can read/write by all user in hadoop group.

hive-site.xml

<name>hive.security.authorization.sqlstd.confwhitelist.append</name>

<value>hive\.exec\.scratchdir|hive\.exec\.stagingdir</value>

<description>append conf property in white list followed by pipeline</description>

</property>

Restart the metastore and hiveserver.

I tested with beeline with session level change . The execution is fast like HIVE CLI .

hive.exec.scratchdir=/user/hive/scratch

hive.exec.stagingdir=/user/hive/staging

I tested with HIVE CLI with session level change. The execution is slow with MAP reduce Job for moving data.

hive.exec.scratchdir=/tmp/hive/scratch

hive.exec.stagingdir=/tmp/hive/staging

So the root cause is data are encrypted in /user/hive and not encrypted in /tmp/hive.

Solution is to make ssession level change to use same encryption zone.

So below INFO log will be printed if the encryption zones are different.

metadata.Hive: Copying source hdfs://edhcluster/tmp/hive/staging_hive_2018-08-07_16- 29-12_750_8973639287951385407-1/-ext-10000/000001_0 to hdfs://edhcluster/user/hive/warehouse/temp_tro/000001_0 because HDFS encryption zones are different.

Thanks,

Manjil