Support Questions

Supernova · ‎03-27-2023

Hi everyone,

So I'm trying to write a data quality query that is efficient in counting the number of nulls in particular columns. Currently the query that I use is as below(I actually found a related example here on a related topic) :

SELECT CustFirstName, CustLastName

,CASE WHEN CustFirstName is null then 1 else 0 end CustFirstNameNullCheck

,CASE WHEN CustLastName is null then 1 else 0 end CustLastNameNullCheck

FROM SchemaName.DbName

WHERE Date = '2023-03-15'

The output that I get is:

CustFirstName	CustLastName	CustFirstNameNullCheck	CustLastNameNullCheck
{null}	{null}	1	1

And then hundreds of thousands of single rows of {nulls} that follow(I know, it's a long story).

Anyhow, the output that I would LIKE to get, please, is a much simpler layout with just the column(or an alias) and a count of the number of total nulls:

CustFirstName	CustLastName
500,000 (nulls)	500,000 (nulls)

I've tried different iterations of SELECT COUNT(*) and just could not get the query right. MapR gives a gigantic error message when trying to use it. Thanks in advance!

mszurap · ‎03-28-2023

Hive @Supernova ,

There can be multiple solutions probabaly, for one example with a subquery you can use the following:

select concat(cast(sum(CustFirstNameNullCheck) as bigint),' (nulls)') as CustFirstName, concat(cast(sum(CustLastNameNullCheck) as bigint),' (nulls)') as CustLastName from (select CASE WHEN CustFirstName is null then 1 else 0 end CustFirstNameNullCheck, CASE WHEN CustLastName is null then 1 else 0 end CustLastNameNullCheck from SchemaName.DbName WHERE Date = '2023-03-15') a;

Hope this helps.

Best regards, Miklos

View solution in original post

DianaTorres · ‎03-27-2023

@Supernova Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our Hive experts @Shmoo and @mszurap who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.

Regards,

Diana Torres,
Community Moderator

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Community Guidelines
How to use the forum

mszurap · ‎03-28-2023

Hive @Supernova ,

There can be multiple solutions probabaly, for one example with a subquery you can use the following:

select concat(cast(sum(CustFirstNameNullCheck) as bigint),' (nulls)') as CustFirstName, concat(cast(sum(CustLastNameNullCheck) as bigint),' (nulls)') as CustLastName from (select CASE WHEN CustFirstName is null then 1 else 0 end CustFirstNameNullCheck, CASE WHEN CustLastName is null then 1 else 0 end CustLastNameNullCheck from SchemaName.DbName WHERE Date = '2023-03-15') a;

Hope this helps.

Best regards, Miklos

Supernova · ‎03-29-2023

Thanks so much, @mszurap! Your query worked amazingly efficient!

mszurap · ‎03-29-2023

Great, glad to hear that it was helpful.

Actually I was thinking about using the NVL function, but in Hive that does not offer a value for the "else" part, like Impala's NVL2 funcion:

https://impala.apache.org/docs/build/asf-site-html/topics/impala_conditional_functions.html#conditio...

with that the query would be much simpler (no need for CASE WHEN ... THEN ... ELSE ... END), just a "NVL2(CustFirstName, 0, 1)

SELECT NVL2('ABC', 'Is Not Null', 'Is Null'); -- Returns 'Is Not Null'

Again, this is for Impala, Hive does not have this function unfortunately.

Supernova · ‎03-29-2023

Thanks for the info! I am educating myself and certainly appreciate it.

Using NVL2, it looks like it either returns a 0 if a null is found, or the specified expression in the argument.

Even if I used Impala, NVL2 wouldn't work for me- as I need(and expect) the specific number count of null records across given columns, right? Just wondering. Thanks again!

mszurap · ‎03-29-2023

I assume you meant NVL function returns 0 or 1 if a null is found, or the specified expression in the argument - when using Hive.

With Impala and NVL2 you would still need to have the outer query to "sum" up all the 1 values what we have mapped from the column value (from their real value to 0 or to 1). It would just be a bit nicer, but no real change.

Support Questions

Null check query