Created 03-27-2023 08:34 AM
Hi everyone,
So I'm trying to write a data quality query that is efficient in counting the number of nulls in particular columns. Currently the query that I use is as below(I actually found a related example here on a related topic) :
SELECT CustFirstName, CustLastName
,CASE WHEN CustFirstName is null then 1 else 0 end CustFirstNameNullCheck
,CASE WHEN CustLastName is null then 1 else 0 end CustLastNameNullCheck
FROM SchemaName.DbName
WHERE Date = '2023-03-15'
The output that I get is:
CustFirstName | CustLastName | CustFirstNameNullCheck | CustLastNameNullCheck |
{null} | {null} | 1 | 1 |
And then hundreds of thousands of single rows of {nulls} that follow(I know, it's a long story).
Anyhow, the output that I would LIKE to get, please, is a much simpler layout with just the column(or an alias) and a count of the number of total nulls:
CustFirstName | CustLastName |
500,000 (nulls) | 500,000 (nulls) |
I've tried different iterations of SELECT COUNT(*) and just could not get the query right. MapR gives a gigantic error message when trying to use it. Thanks in advance!
Created 03-28-2023 06:43 AM
Hive @Supernova ,
There can be multiple solutions probabaly, for one example with a subquery you can use the following:
select concat(cast(sum(CustFirstNameNullCheck) as bigint),' (nulls)') as CustFirstName, concat(cast(sum(CustLastNameNullCheck) as bigint),' (nulls)') as CustLastName from (select CASE WHEN CustFirstName is null then 1 else 0 end CustFirstNameNullCheck, CASE WHEN CustLastName is null then 1 else 0 end CustLastNameNullCheck from SchemaName.DbName WHERE Date = '2023-03-15') a;
Hope this helps.
Best regards, Miklos
Created 03-27-2023 10:53 AM
@Supernova Welcome to the Cloudera Community!
To help you get the best possible solution, I have tagged our Hive experts @Shmoo and @mszurap who may be able to assist you further.
Please keep us updated on your post, and we hope you find a satisfactory solution to your query.
Regards,
Diana Torres,Created 03-28-2023 06:43 AM
Hive @Supernova ,
There can be multiple solutions probabaly, for one example with a subquery you can use the following:
select concat(cast(sum(CustFirstNameNullCheck) as bigint),' (nulls)') as CustFirstName, concat(cast(sum(CustLastNameNullCheck) as bigint),' (nulls)') as CustLastName from (select CASE WHEN CustFirstName is null then 1 else 0 end CustFirstNameNullCheck, CASE WHEN CustLastName is null then 1 else 0 end CustLastNameNullCheck from SchemaName.DbName WHERE Date = '2023-03-15') a;
Hope this helps.
Best regards, Miklos
Created 03-29-2023 08:41 AM
Thanks so much, @mszurap! Your query worked amazingly efficient!
Created 03-29-2023 08:49 AM
Great, glad to hear that it was helpful.
Actually I was thinking about using the NVL function, but in Hive that does not offer a value for the "else" part, like Impala's NVL2 funcion:
with that the query would be much simpler (no need for CASE WHEN ... THEN ... ELSE ... END), just a "NVL2(CustFirstName, 0, 1)
SELECT NVL2('ABC', 'Is Not Null', 'Is Null'); -- Returns 'Is Not Null'
Again, this is for Impala, Hive does not have this function unfortunately.
Created 03-29-2023 09:12 AM
Thanks for the info! I am educating myself and certainly appreciate it.
Using NVL2, it looks like it either returns a 0 if a null is found, or the specified expression in the argument.
Even if I used Impala, NVL2 wouldn't work for me- as I need(and expect) the specific number count of null records across given columns, right? Just wondering. Thanks again!
Created 03-29-2023 09:33 AM
I assume you meant NVL function returns 0 or 1 if a null is found, or the specified expression in the argument - when using Hive.
With Impala and NVL2 you would still need to have the outer query to "sum" up all the 1 values what we have mapped from the column value (from their real value to 0 or to 1). It would just be a bit nicer, but no real change.