Support Questions

Find answers, ask questions, and share your expertise

Fill 'Null' With Previous Row Values in Hive

avatar
Contributor

Hi,

I just wanted to know is there a way to fill up null values with previous value/record in Hive?

Example:

date employee salary

5/16/16 Dave 25,000

5/17/16 Richard 10,000

5/18/16 NULL NULL

5/17/16 Howard 50,000

5/18/16 NULL NULL

INTO:

date employee salary

5/16/16 Dave 25,000

5/17/16 Richard 10,000

5/18/16 Richard 10,000

5/17/16 Howard 50,000

5/18/16 Howard 50,000

Assuming there's thousand records with several null values. Appreciate your help on this.

To understand further, attached is my primary objective.

4991-1.png

Regards,

Bruce

1 ACCEPTED SOLUTION

avatar
Contributor

@Bruce Perez Are you looking for the SQL that will help you achieve the "filling" that you want to do?

Looking at your simplified picture, I created two tables: dates and date_rate and populated them with your sample data. The query you can use to do the appropriate joins and "fill down" the missing values should look roughly like this. There may be other methods, but in this one, we create a new view of your date_rate data, by assuming the "end_date" is one less than the next date_rate record. Then, we can join the dates table onto that using the range instead.

SELECT
  d.*,
  dr.*
FROM
  dates d LEFT OUTER JOIN
  (
    SELECT
      r1.rate,
      r1.dt AS start_date,
      MIN(DATE_SUB(r2.dt, 1)) AS end_date
    FROM
      date_rate r1 LEFT OUTER JOIN
      date_rate r2 ON r1.dt < r2.dt
    GROUP BY
      r1.rate, r1.dt
    ) dr ON d.dt >= dr.start_date AND (d.dt <= dr.end_date OR dr.end_date IS NULL)
ORDER BY
  d.dt

View solution in original post

9 REPLIES 9

avatar
Master Guru

@Bruce Perez

If your data is in ORC format this can be done by simple performing a update statement on your table.

INSERT ... VALUES, UPDATE, and DELETE SQL statements are supported in Apache Hive 0.14 and later. The INSERT ... VALUES statement enable users to write data to Apache Hive from values provided in SQL statements. The UPDATE and DELETE statements enable users to modify and delete values already written to Hive. All three statements support auto-commit, which means that each statement is a separate transaction that is automatically committed after the SQL statement is executed.

More information available here.

avatar
Master Guru

if your table is not in orc format, then create another table just like the one you have today like this:

  • CREATE TABLE ... STORED AS ORC
  • ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT ORC
  • SET hive.default.fileformat=Orc

then insert into this table from your existing table. you can use statement INSERT INTO TABLE tablename1

avatar
Contributor

Thanks @Sunile Manjee. The point is i'm creating a table based from a joined table (combined_table). Next, I need to fill up the null values in the combined_table since I queried full outer join. With the combined_table, I need to fill up now the null values based from the previous row value through created the derived_table. I'm not using orc format.

avatar
Master Guru

@Bruce Perez

how about using COALESCE?

Returns the first v that is not NULL, or NULL if all v's are NULL.

SELECT COALESCE(datefield1, datefield2, datefield3) as first_date_found FROM tblDates WHERE primary_key = 1

avatar
Contributor

@Sunile Manjee: Yeah, I tried also COALESCE. Anyway, Thanks, appreciate your help

avatar
Master Guru

As a next step you will need to create a table with orc format, fill the table with your joined dat using insert into select ..., then update it using method i have described..

avatar
Contributor

@Bruce Perez Are you looking for the SQL that will help you achieve the "filling" that you want to do?

Looking at your simplified picture, I created two tables: dates and date_rate and populated them with your sample data. The query you can use to do the appropriate joins and "fill down" the missing values should look roughly like this. There may be other methods, but in this one, we create a new view of your date_rate data, by assuming the "end_date" is one less than the next date_rate record. Then, we can join the dates table onto that using the range instead.

SELECT
  d.*,
  dr.*
FROM
  dates d LEFT OUTER JOIN
  (
    SELECT
      r1.rate,
      r1.dt AS start_date,
      MIN(DATE_SUB(r2.dt, 1)) AS end_date
    FROM
      date_rate r1 LEFT OUTER JOIN
      date_rate r2 ON r1.dt < r2.dt
    GROUP BY
      r1.rate, r1.dt
    ) dr ON d.dt >= dr.start_date AND (d.dt <= dr.end_date OR dr.end_date IS NULL)
ORDER BY
  d.dt

avatar
Contributor

@Paul Boal: Thanks, this is the nearest solution to my problem, just need modification on this one.

avatar
New Contributor

Also LEAD() function can be used with COALESCE()

For example 

 

coalesce(sepan, lead(sepan, 1) over (partition by sourcedata order by timestamps)) as sepan

 

That will replace NULL values in sepan column with next non null value. The timestamps column could be ordered in asc order depending on the value you wan to copy.