Member since
03-28-2018
1
Post
0
Kudos Received
0
Solutions
03-28-2018
12:30 PM
Hi community. First of all: thank you all a lot for the great work everyone is doing here, trying to help! The forum brought me far with Hortonworks. However, I am stuck now with using UDFs. What I have: I have a number of different UDFs that are used across our analysis platform in hive. The UDFs are written in Java and work perfectly fine in normal select statements. So no problems there. What I want to achieve: For the process that needs to be built, it is essential to fill an empty table with processed data. Processed data in detail means, that I have about 120 million records, and I calculate indicator values based on the "surrounding" rows with lag and lead. To calculate the indicator values, the UDF is used (basically, it takes an array from the leading line and processes it with an array from the lagging line). As I said: No problems there. Runds pretty smooth. My understanding was, that when I use the statement to insert the data into an empty table, the exact output of the select statement will be written. But: This is not the case. The columns, that should contain the indicator values, are null. Questions: Is it even possible to use UDFs in "insert as select" statements? If so, is there anything special I need to take in account? Of course I have attached the relevant code: Hive UDF jar (stripped out some info): package udfs;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;
import org.apache.hadoop.io.Text;
@UDFType(stateful = true)
public class DianaKennzahl extends UDF {
public Double evaluate(Text this_array_string, Text previous_array_string, Integer kennzahl_typ,
Integer samplingInterval) {
... do some magic ...
return result;
}
}
And my hive query looks like this: insert into test as
select
*,
diana_kennzahl(
lag(
`current`,
1
) over(
partition by objectid, position
order by
objectid asc,
position asc,
time asc
),
`current`,
1,
samplinginterval) as kennzahl1,
diana_kennzahl(
lag(
`current`,
1
) over(
partition by objectid, position
order by
objectid asc,
position asc,
time asc
),
`current`,
2,
samplinginterval) as kennzahl2,
diana_kennzahl(
lag(
`current`,
1
) over(
partition by objectid, position
order by
objectid asc,
position asc,
time asc
),
`current`,
3,
samplinginterval) as kennzahl3,
diana_kennzahl(
lag(
`current`,
1
) over(
partition by objectid, position
order by
objectid asc,
position asc,
time asc
),
`current`,
4,
samplinginterval) as kennzahl4
from
test_input
order by
objectid asc,
position asc,
time asc
;
Simple select: no problem Insert as select: last 4 columns (the ones calculated) stay empty I am thankful for anything that points me in the right direction. Even if it's "You can't do that" 🙂 Thanks in advance!
... View more
Labels: