Created on 06-14-2016 04:28 AM - edited 08-19-2019 01:47 AM
Hi,
I just wanted to know is there a way to fill up null values with previous value/record in Hive?
Example:
date employee salary
5/16/16 Dave 25,000
5/17/16 Richard 10,000
5/18/16 NULL NULL
5/17/16 Howard 50,000
5/18/16 NULL NULL
INTO:
date employee salary
5/16/16 Dave 25,000
5/17/16 Richard 10,000
5/18/16 Richard 10,000
5/17/16 Howard 50,000
5/18/16 Howard 50,000
Assuming there's thousand records with several null values. Appreciate your help on this.
To understand further, attached is my primary objective.
Regards,
Bruce
Created 06-14-2016 07:59 PM
@Bruce Perez Are you looking for the SQL that will help you achieve the "filling" that you want to do?
Looking at your simplified picture, I created two tables: dates and date_rate and populated them with your sample data. The query you can use to do the appropriate joins and "fill down" the missing values should look roughly like this. There may be other methods, but in this one, we create a new view of your date_rate data, by assuming the "end_date" is one less than the next date_rate record. Then, we can join the dates table onto that using the range instead.
SELECT d.*, dr.* FROM dates d LEFT OUTER JOIN ( SELECT r1.rate, r1.dt AS start_date, MIN(DATE_SUB(r2.dt, 1)) AS end_date FROM date_rate r1 LEFT OUTER JOIN date_rate r2 ON r1.dt < r2.dt GROUP BY r1.rate, r1.dt ) dr ON d.dt >= dr.start_date AND (d.dt <= dr.end_date OR dr.end_date IS NULL) ORDER BY d.dt
Created 06-14-2016 04:31 AM
If your data is in ORC format this can be done by simple performing a update statement on your table.
INSERT ... VALUES, UPDATE, and DELETE SQL statements are supported in Apache Hive 0.14 and later. The INSERT ... VALUES statement enable users to write data to Apache Hive from values provided in SQL statements. The UPDATE and DELETE statements enable users to modify and delete values already written to Hive. All three statements support auto-commit, which means that each statement is a separate transaction that is automatically committed after the SQL statement is executed.
More information available here.
Created 06-14-2016 04:36 AM
if your table is not in orc format, then create another table just like the one you have today like this:
CREATE TABLE ... STORED AS ORC
ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT ORC
SET hive.default.fileformat=Orc
then insert into this table from your existing table. you can use statement INSERT INTO TABLE tablename1
Created 06-14-2016 05:02 AM
Thanks @Sunile Manjee. The point is i'm creating a table based from a joined table (combined_table). Next, I need to fill up the null values in the combined_table since I queried full outer join. With the combined_table, I need to fill up now the null values based from the previous row value through created the derived_table. I'm not using orc format.
Created 06-14-2016 09:22 PM
how about using COALESCE?
Returns the first v that is not NULL, or NULL if all v's are NULL. |
SELECT COALESCE(datefield1, datefield2, datefield3) as first_date_found
FROM
tblDates
WHERE
primary_key = 1
Created 06-15-2016 03:38 AM
@Sunile Manjee: Yeah, I tried also COALESCE. Anyway, Thanks, appreciate your help
Created 06-14-2016 12:47 PM
As a next step you will need to create a table with orc format, fill the table with your joined dat using insert into select ..., then update it using method i have described..
Created 06-14-2016 07:59 PM
@Bruce Perez Are you looking for the SQL that will help you achieve the "filling" that you want to do?
Looking at your simplified picture, I created two tables: dates and date_rate and populated them with your sample data. The query you can use to do the appropriate joins and "fill down" the missing values should look roughly like this. There may be other methods, but in this one, we create a new view of your date_rate data, by assuming the "end_date" is one less than the next date_rate record. Then, we can join the dates table onto that using the range instead.
SELECT d.*, dr.* FROM dates d LEFT OUTER JOIN ( SELECT r1.rate, r1.dt AS start_date, MIN(DATE_SUB(r2.dt, 1)) AS end_date FROM date_rate r1 LEFT OUTER JOIN date_rate r2 ON r1.dt < r2.dt GROUP BY r1.rate, r1.dt ) dr ON d.dt >= dr.start_date AND (d.dt <= dr.end_date OR dr.end_date IS NULL) ORDER BY d.dt
Created 06-15-2016 03:39 AM
@Paul Boal: Thanks, this is the nearest solution to my problem, just need modification on this one.
Created 02-20-2020 03:52 AM
Also LEAD() function can be used with COALESCE()
For example
coalesce(sepan, lead(sepan, 1) over (partition by sourcedata order by timestamps)) as sepan
That will replace NULL values in sepan column with next non null value. The timestamps column could be ordered in asc order depending on the value you wan to copy.