Support Questions

Find answers, ask questions, and share your expertise

Null check query

avatar
Contributor

Hi everyone,

 

So I'm trying to write a data quality query that is efficient in counting the number of nulls in particular columns. Currently the query that I use is as below(I actually found a related example here on a related topic) :

SELECT CustFirstName, CustLastName

,CASE WHEN CustFirstName is null then 1 else 0 end CustFirstNameNullCheck

,CASE WHEN CustLastName is null then 1 else 0 end CustLastNameNullCheck

FROM SchemaName.DbName

WHERE Date = '2023-03-15'

 

The output that I get is:

CustFirstNameCustLastNameCustFirstNameNullCheckCustLastNameNullCheck
{null}{null}11

 

And then hundreds of thousands of single rows of {nulls} that follow(I know, it's a long story).

 

Anyhow, the output that I would LIKE to get, please, is a much simpler layout with just the column(or an alias) and a count of the number of total nulls:

CustFirstNameCustLastName
500,000 (nulls)500,000 (nulls)

 

I've tried different iterations of SELECT COUNT(*) and just could not get the query right. MapR gives a gigantic error message when trying to use it. Thanks in advance!

1 ACCEPTED SOLUTION

avatar

Hive @Supernova ,

There can be multiple solutions probabaly, for one example with a subquery you can use the following:

 

select concat(cast(sum(CustFirstNameNullCheck) as bigint),' (nulls)') as CustFirstName, concat(cast(sum(CustLastNameNullCheck) as bigint),' (nulls)') as CustLastName from (select CASE WHEN CustFirstName is null then 1 else 0 end CustFirstNameNullCheck, CASE WHEN CustLastName is null then 1 else 0 end CustLastNameNullCheck from SchemaName.DbName WHERE Date = '2023-03-15') a;

 Hope this helps.

Best regards, Miklos

View solution in original post

6 REPLIES 6

avatar
Community Manager

@Supernova Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our Hive experts @Shmoo and @mszurap  who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.


Regards,

Diana Torres,
Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar

Hive @Supernova ,

There can be multiple solutions probabaly, for one example with a subquery you can use the following:

 

select concat(cast(sum(CustFirstNameNullCheck) as bigint),' (nulls)') as CustFirstName, concat(cast(sum(CustLastNameNullCheck) as bigint),' (nulls)') as CustLastName from (select CASE WHEN CustFirstName is null then 1 else 0 end CustFirstNameNullCheck, CASE WHEN CustLastName is null then 1 else 0 end CustLastNameNullCheck from SchemaName.DbName WHERE Date = '2023-03-15') a;

 Hope this helps.

Best regards, Miklos

avatar
Contributor

Thanks so much, @mszurap! Your query worked amazingly efficient!

avatar

Great, glad to hear that it was helpful.

Actually I was thinking about using the NVL function, but in Hive that does not offer a value for the "else" part, like Impala's NVL2 funcion:

https://impala.apache.org/docs/build/asf-site-html/topics/impala_conditional_functions.html#conditio...

 

with that the query would be much simpler (no need for CASE WHEN ... THEN ... ELSE ... END), just a "NVL2(CustFirstName, 0, 1)

SELECT NVL2('ABC', 'Is Not Null', 'Is Null'); -- Returns 'Is Not Null'

Again, this is for Impala, Hive does not have this function unfortunately.

avatar
Contributor

Thanks for the info! I am educating myself and certainly appreciate it. 

Using NVL2, it looks like it either returns a 0 if a null is found, or the specified expression in the argument.

 

Even if I used Impala, NVL2 wouldn't work for me- as I need(and expect) the specific number count of null records across given columns, right? Just wondering. Thanks again!

avatar

I assume you meant NVL function returns 0 or 1 if a null is found, or the specified expression in the argument - when using Hive.

With Impala and NVL2 you would still need to have the outer query to "sum" up all the 1 values what we have mapped from the column value (from their real value to 0 or to 1). It would just be a bit nicer, but no real change.