Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

"insert as select" with UDF in select statement

Highlighted

"insert as select" with UDF in select statement

New Contributor

Hi community. First of all: thank you all a lot for the great work everyone is doing here, trying to help! The forum brought me far with Hortonworks. However, I am stuck now with using UDFs.

What I have:

I have a number of different UDFs that are used across our analysis platform in hive. The UDFs are written in Java and work perfectly fine in normal select statements. So no problems there.

What I want to achieve:

For the process that needs to be built, it is essential to fill an empty table with processed data. Processed data in detail means, that I have about 120 million records, and I calculate indicator values based on the "surrounding" rows with lag and lead. To calculate the indicator values, the UDF is used (basically, it takes an array from the leading line and processes it with an array from the lagging line). As I said: No problems there. Runds pretty smooth.

My understanding was, that when I use the statement to insert the data into an empty table, the exact output of the select statement will be written. But: This is not the case. The columns, that should contain the indicator values, are null.

Questions:

Is it even possible to use UDFs in "insert as select" statements?

If so, is there anything special I need to take in account?

Of course I have attached the relevant code:

Hive UDF jar (stripped out some info):

package udfs;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;
import org.apache.hadoop.io.Text;

@UDFType(stateful = true)
public class DianaKennzahl extends UDF {

	public Double evaluate(Text this_array_string, Text previous_array_string, Integer kennzahl_typ,
			Integer samplingInterval) {

		... do some magic ...

		return result;

	}
}

And my hive query looks like this:

insert into test as
select
 *,
 diana_kennzahl(
 lag(
  `current`,
  1
 ) over(
  partition by objectid, position
 order by
  objectid asc,
  position asc,
  time asc
 ),
 `current`,
 1, 
 samplinginterval) as kennzahl1,
 diana_kennzahl(
 lag(
  `current`,
  1
 ) over(
  partition by objectid, position
 order by
  objectid asc,
  position asc,
  time asc
 ),
 `current`,
 2,
 samplinginterval) as kennzahl2,
 diana_kennzahl(
 lag(
  `current`,
  1
 ) over(
  partition by objectid, position
 order by
  objectid asc,
  position asc,
  time asc
 ),
 `current`,
 3,
 samplinginterval) as kennzahl3,
 
 diana_kennzahl(
 lag(
  `current`,
  1
 ) over(
  partition by objectid, position
 order by
  objectid asc,
  position asc,
  time asc
 ),
 `current`,
 4,
 samplinginterval) as kennzahl4 
from
 test_input
order by
 objectid asc,
 position asc,
 time asc
;

Simple select: no problem

Insert as select: last 4 columns (the ones calculated) stay empty

I am thankful for anything that points me in the right direction. Even if it's "You can't do that" :)

Thanks in advance!

1 REPLY 1

Re: "insert as select" with UDF in select statement

New Contributor

@Frederick, Any luck with this issue? I am also stuck with same problem, any help is appreciated.

Don't have an account?
Coming from Hortonworks? Activate your account here