Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Null check query

avatar
Explorer

Hi everyone,

 

So I'm trying to write a data quality query that is efficient in counting the number of nulls in particular columns. Currently the query that I use is as below(I actually found a related example here on a related topic) :

SELECT CustFirstName, CustLastName

,CASE WHEN CustFirstName is null then 1 else 0 end CustFirstNameNullCheck

,CASE WHEN CustLastName is null then 1 else 0 end CustLastNameNullCheck

FROM SchemaName.DbName

WHERE Date = '2023-03-15'

 

The output that I get is:

CustFirstNameCustLastNameCustFirstNameNullCheckCustLastNameNullCheck
{null}{null}11

 

And then hundreds of thousands of single rows of {nulls} that follow(I know, it's a long story).

 

Anyhow, the output that I would LIKE to get, please, is a much simpler layout with just the column(or an alias) and a count of the number of total nulls:

CustFirstNameCustLastName
500,000 (nulls)500,000 (nulls)

 

I've tried different iterations of SELECT COUNT(*) and just could not get the query right. MapR gives a gigantic error message when trying to use it. Thanks in advance!

1 ACCEPTED SOLUTION

avatar

Hive @Supernova ,

There can be multiple solutions probabaly, for one example with a subquery you can use the following:

 

select concat(cast(sum(CustFirstNameNullCheck) as bigint),' (nulls)') as CustFirstName, concat(cast(sum(CustLastNameNullCheck) as bigint),' (nulls)') as CustLastName from (select CASE WHEN CustFirstName is null then 1 else 0 end CustFirstNameNullCheck, CASE WHEN CustLastName is null then 1 else 0 end CustLastNameNullCheck from SchemaName.DbName WHERE Date = '2023-03-15') a;

 Hope this helps.

Best regards, Miklos

View solution in original post

6 REPLIES 6

avatar
Community Manager

@Supernova Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our Hive experts @Shmoo and @mszurap  who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.


Regards,

Diana Torres,
Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar

Hive @Supernova ,

There can be multiple solutions probabaly, for one example with a subquery you can use the following:

 

select concat(cast(sum(CustFirstNameNullCheck) as bigint),' (nulls)') as CustFirstName, concat(cast(sum(CustLastNameNullCheck) as bigint),' (nulls)') as CustLastName from (select CASE WHEN CustFirstName is null then 1 else 0 end CustFirstNameNullCheck, CASE WHEN CustLastName is null then 1 else 0 end CustLastNameNullCheck from SchemaName.DbName WHERE Date = '2023-03-15') a;

 Hope this helps.

Best regards, Miklos

avatar
Explorer

Thanks so much, @mszurap! Your query worked amazingly efficient!

avatar

Great, glad to hear that it was helpful.

Actually I was thinking about using the NVL function, but in Hive that does not offer a value for the "else" part, like Impala's NVL2 funcion:

https://impala.apache.org/docs/build/asf-site-html/topics/impala_conditional_functions.html#conditio...

 

with that the query would be much simpler (no need for CASE WHEN ... THEN ... ELSE ... END), just a "NVL2(CustFirstName, 0, 1)

SELECT NVL2('ABC', 'Is Not Null', 'Is Null'); -- Returns 'Is Not Null'

Again, this is for Impala, Hive does not have this function unfortunately.

avatar
Explorer

Thanks for the info! I am educating myself and certainly appreciate it. 

Using NVL2, it looks like it either returns a 0 if a null is found, or the specified expression in the argument.

 

Even if I used Impala, NVL2 wouldn't work for me- as I need(and expect) the specific number count of null records across given columns, right? Just wondering. Thanks again!

avatar

I assume you meant NVL function returns 0 or 1 if a null is found, or the specified expression in the argument - when using Hive.

With Impala and NVL2 you would still need to have the outer query to "sum" up all the 1 values what we have mapped from the column value (from their real value to 0 or to 1). It would just be a bit nicer, but no real change.