Support Questions

michaelli · ‎06-08-2021

Hi guys,

How to get user specified configuration data in hive udf? what is your solution?

i think there are basically three ways to achieve this goal:

1. use a hdfs file to store the configuration data in xml/json/txt format, and then read the hdfs file in udf, users need to change the hdfs file when they want to change specific configuration parms;

2. use a local xml/json/txt file and pack it into the udf jar, then read the local file in udf, users need to change the local file and repack it into the udf jar when they want to change specific configuration parms;

3. use a hive table to store the configuration data, then read the table using hive sql dml statements in the udf (I know this sounds strange, normally we don't issue sql queries to hiveserver2 in udf, but this should also be possble), users can use hive sql dml to change specific configuration parms when they need to.

4. use a hive table to store the configuration data, then read the underneath hdfs file to get configuration details in the udf (of course, you need the configuration hive table to be in text format like csv ), users can use hive sql dml to change specific configuration parms when they need to.

What is your way of achieving this? Has anyway used method 3 above?

keep striving!

michaelli · ‎06-09-2021

A self update on solution 3:

I tested this solution in my CDH6.2 environment, by using beeline to connect to hiveserver2 and issue sql query which includes udf, (the udf itself have codes to connect to hiveserver2 and issue sql queries), it turns out this kind of udf usage makes the hiveserver2 service not functioning properly: the udf call itself hangs there for a long time and no result returned (i think it will hang there forever - until we restart the hiveserver2 service), meanwhile other beeline clients can connect to hiveserver2 successfully but no sql statements can be executed successfully, even a simple "show databases" command will hang there for a long time with no result returned (i think it will hang there forever - until we restart the hiveserver2 service).

I think this kind of udf usage, which have sql queries again hiveserver2 inside udf, WILL NOT function properly: because when we submit udf call to hiveserver2, the udf itself is first analyzed by hiveserver2, which means the sql call against hiveserver2 inside the udf codes are also executed by hiveserver2 to connect to hiveserver2 and issues sql queries again itself,which makes the server side hiveserver2 also a client.

keep striving!

View solution in original post

michaelli · ‎06-09-2021

A final update on solution 3:

1. when you use beeline to connect to a hiveserver2 instance (let's name this hiveserver2 instance hiveserver2-instance1) and submit statement like "select udfTest(20210101) from testTableA", if the udf itself contains java codes to connect to the same hiveserver2 instance hiveserver2-instance1 and executes any statement, it will make the hiveserver2-instance1 not function properly;

2. when you use beeline to connect to a hiveserver2 instance( let's name this hiveserver2 instance hiveserver2-instance1) and submit statement like "select udfTest(20210101) from testTableA", if the udf itself contains java codes to connect to the another hiveserver2 instance like hiveserver2-instance2 and executes any statement, then both hiveserver2-instance1 and hiveserver2-instance2 will function properly;

3. when you use hive service --cli to submit statement like "select udfTest(20210101) from testTableA", and the udf calls other hiveserver2 instances like hiveserver2-instance1, then both hive cli and the hiveserver2 instance will funciton properly;

4. when you use beeline to connect to a hiveserver2 instance( let's name this hiveserver2 instance hiveserver2-instance1) and submit statement like "select udfTest(user_code) from testTableA", if the udf itself contains java codes to connect to the same hiveserver2 instance like hiveserver2-instance1 and executes any statement, then hiveserver2-instance1 will function properly.

The root cause is whether you are using the same hiveserver2 instance as both hive sql client and hive sql server: this is not the case for scenario 2 and 3, where you use different hiveserver2 or hive service --cli; this is also not the case for scenario 4: in this case a mr/tez/spark job is generated and scheduled to run in a yarn container, which acts as the sql client and connects back to the hiveserver2 to submit sqls; but this is the case for scenario 1, as when hiveserver2 analyze and compile the sql statement "select udfTest(20210101) from testTableA" , it finds that no map task need be generated (as we are using constant 20210101 here, no table records need be fetched), so as part of the analyze and compile process, it connects to itself and tries to execute the sql call itself, which makes it both the sql client and sql server.

So to sum up, use sql calls against hiveserver2 inside udf is not a goot practice.

keep striving!

View solution in original post

michaelli · ‎06-09-2021

A self update on solution 3:

I tested this solution in my CDH6.2 environment, by using beeline to connect to hiveserver2 and issue sql query which includes udf, (the udf itself have codes to connect to hiveserver2 and issue sql queries), it turns out this kind of udf usage makes the hiveserver2 service not functioning properly: the udf call itself hangs there for a long time and no result returned (i think it will hang there forever - until we restart the hiveserver2 service), meanwhile other beeline clients can connect to hiveserver2 successfully but no sql statements can be executed successfully, even a simple "show databases" command will hang there for a long time with no result returned (i think it will hang there forever - until we restart the hiveserver2 service).

I think this kind of udf usage, which have sql queries again hiveserver2 inside udf, WILL NOT function properly: because when we submit udf call to hiveserver2, the udf itself is first analyzed by hiveserver2, which means the sql call against hiveserver2 inside the udf codes are also executed by hiveserver2 to connect to hiveserver2 and issues sql queries again itself,which makes the server side hiveserver2 also a client.

keep striving!

michaelli · ‎06-09-2021

A final update on solution 3:

1. when you use beeline to connect to a hiveserver2 instance (let's name this hiveserver2 instance hiveserver2-instance1) and submit statement like "select udfTest(20210101) from testTableA", if the udf itself contains java codes to connect to the same hiveserver2 instance hiveserver2-instance1 and executes any statement, it will make the hiveserver2-instance1 not function properly;

2. when you use beeline to connect to a hiveserver2 instance( let's name this hiveserver2 instance hiveserver2-instance1) and submit statement like "select udfTest(20210101) from testTableA", if the udf itself contains java codes to connect to the another hiveserver2 instance like hiveserver2-instance2 and executes any statement, then both hiveserver2-instance1 and hiveserver2-instance2 will function properly;

3. when you use hive service --cli to submit statement like "select udfTest(20210101) from testTableA", and the udf calls other hiveserver2 instances like hiveserver2-instance1, then both hive cli and the hiveserver2 instance will funciton properly;

4. when you use beeline to connect to a hiveserver2 instance( let's name this hiveserver2 instance hiveserver2-instance1) and submit statement like "select udfTest(user_code) from testTableA", if the udf itself contains java codes to connect to the same hiveserver2 instance like hiveserver2-instance1 and executes any statement, then hiveserver2-instance1 will function properly.

The root cause is whether you are using the same hiveserver2 instance as both hive sql client and hive sql server: this is not the case for scenario 2 and 3, where you use different hiveserver2 or hive service --cli; this is also not the case for scenario 4: in this case a mr/tez/spark job is generated and scheduled to run in a yarn container, which acts as the sql client and connects back to the hiveserver2 to submit sqls; but this is the case for scenario 1, as when hiveserver2 analyze and compile the sql statement "select udfTest(20210101) from testTableA" , it finds that no map task need be generated (as we are using constant 20210101 here, no table records need be fetched), so as part of the analyze and compile process, it connects to itself and tries to execute the sql call itself, which makes it both the sql client and sql server.

So to sum up, use sql calls against hiveserver2 inside udf is not a goot practice.

keep striving!

Cloudera Community

Support Questions

possible ways to get configuration data in hive udf

Creating custom udf and adding udf jar to Hive LLA...

Hive UDFs vs Spatial SQL

Apache Hive Groovy UDF examples

How to configure K9s for Cloudera Data Engineering

Spark HWC integration with Hive UDFs

SparkSQL and ESRI Geospatial UDFs for Hive

Configure HIVE HPLSQL

How to ingest '\n' in data with hive

Is it possible to restore hdfs data based on block...

Easy way Hive Connection String Setup