Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Difference between WHERE ...OR & WHERE ... IN

avatar
Super Collaborator

HDP-2.5.0.0(2.5.0.0-1245)

Query-1 returns 43 rows which include both the equipment no.s 0552094 and :

select equipmentnumber, dimensionid, max(datemessage) maxdtmsg
from group_datascientist.fact_rtc_se
where (equipmentnumber = 0552094 or equipmentnumber = 1857547) and datemessage < '2016-03-01'
group by equipmentnumber, dimensionid order by equipmentnumber, maxdtmsg;
+-------------------------+--------------------+-------------+--+
| equipmentnumber  | dimensionid  |  maxdtmsg   |
+-------------------------+--------------------+-------------+--+
| 0552094                 | 33393              | 2016-01-11  |
| 0552094                 | 23537              | 2016-01-15  |
| 0552094                 | 26115              | 2016-01-28  |
| 0552094                 | 23680              | 2016-01-29  |
| 0552094                 | 23664              | 2016-01-29  |
| 0552094                 | 73714              | 2016-01-29  |
| 0552094                 | 23530              | 2016-02-02  |
| 0552094                 | 23742              | 2016-02-03  |
| 0552094                 | 840502             | 2016-02-04  |
| 0552094                 | 322547             | 2016-02-18  |
| 0552094                 | 24234              | 2016-02-19  |
| 0552094                 | 24419              | 2016-02-19  |
| 0552094                 | 324917             | 2016-02-24  |
| 0552094                 | 1371670            | 2016-02-25  |
| 0552094                 | 156684             | 2016-02-26  |
| 0552094                 | 86745              | 2016-02-26  |
| 0552094                 | 687957             | 2016-02-28  |
| 0552094                 | 103128             | 2016-02-29  |
| 0552094                 | 36081              | 2016-02-29  |
| 0552094                 | 18943              | 2016-02-29  |
| 1857547                 | 927956             | 2016-01-08  |
| 1857547                 | 749597             | 2016-01-15  |
| 1857547                 | 312955             | 2016-01-15  |
| 1857547                 | 1117802            | 2016-01-20  |
| 1857547                 | 903606             | 2016-01-27  |
| 1857547                 | 196616             | 2016-01-29  |
| 1857547                 | 621571             | 2016-02-05  |
| 1857547                 | 175172             | 2016-02-08  |
| 1857547                 | 663615             | 2016-02-10  |
| 1857547                 | 194722             | 2016-02-11  |
| 1857547                 | 175415             | 2016-02-12  |
| 1857547                 | 241920             | 2016-02-15  |
| 1857547                 | 1292068            | 2016-02-15  |
| 1857547                 | 185040             | 2016-02-16  |
| 1857547                 | 181682             | 2016-02-17  |
| 1857547                 | 1234825            | 2016-02-18  |
| 1857547                 | 1444875            | 2016-02-18  |
| 1857547                 | 1175541            | 2016-02-19  |
| 1857547                 | 179475             | 2016-02-19  |
| 1857547                 | 1363760            | 2016-02-23  |
| 1857547                 | 203597             | 2016-02-24  |
| 1857547                 | 815551             | 2016-02-29  |
| 1857547                 | 18943              | 2016-02-29  |
+-------------------------+--------------------+-------------+--+
43 rows selected

In the query-2, I just used the IN operator for better readability but now only 23 rows are returned and only equip. no. 1857547 rows :

select equipmentnumber, dimensionid, max(datemessage) maxdtmsg
from group_datascientist.fact_rtc_se
where equipmentnumber IN(0552094,1857547) and datemessage < '2016-03-01'
group by equipmentnumber, dimensionid order by equipmentnumber,maxdtmsg;
+-------------------------+--------------------+-------------+--+
| equipmentnumber  | dimensionid  |  maxdtmsg   |
+-------------------------+--------------------+-------------+--+
| 1857547                 | 927956             | 2016-01-08  |
| 1857547                 | 749597             | 2016-01-15  |
| 1857547                 | 312955             | 2016-01-15  |
| 1857547                 | 1117802            | 2016-01-20  |
| 1857547                 | 903606             | 2016-01-27  |
| 1857547                 | 196616             | 2016-01-29  |
| 1857547                 | 621571             | 2016-02-05  |
| 1857547                 | 175172             | 2016-02-08  |
| 1857547                 | 663615             | 2016-02-10  |
| 1857547                 | 194722             | 2016-02-11  |
| 1857547                 | 175415             | 2016-02-12  |
| 1857547                 | 241920             | 2016-02-15  |
| 1857547                 | 1292068            | 2016-02-15  |
| 1857547                 | 185040             | 2016-02-16  |
| 1857547                 | 181682             | 2016-02-17  |
| 1857547                 | 1234825            | 2016-02-18  |
| 1857547                 | 1444875            | 2016-02-18  |
| 1857547                 | 179475             | 2016-02-19  |
| 1857547                 | 1175541            | 2016-02-19  |
| 1857547                 | 1363760            | 2016-02-23  |
| 1857547                 | 203597             | 2016-02-24  |
| 1857547                 | 18943              | 2016-02-29  |
| 1857547                 | 815551             | 2016-02-29  |
+-------------------------+--------------------+-------------+--+
23 rows selected

What am I missing ?

1 ACCEPTED SOLUTION

avatar

Hi @Kaliyug Antagonist. I couldn't help but notice that the value in question had a leading zero.

Check the datatype used for the column called equipmentNumber and adjust the IN clause accordingly. It looks like equipmentNumber is defined as a String and thus you will have to quote the values inside the "IN" clause.

Here are some examples & results:

1. Using INT as the datatype for equipmentNumber:

create table if not exists fact_rtc_se
 (equipmentnumber int, dimensionid string, datemessage date) ;

insert into fact_rtc_se values (0552094, 36081, '2016-02-29') ; 
insert into fact_rtc_se values (0552094, 18943, '2016-02-29') ;
insert into fact_rtc_se values (1857547, 27956, '2016-01-08') ;
insert into fact_rtc_se values (1857547, 749597, '2016-01-15') ; 

select equipmentnumber, dimensionid, datemessage from fact_rtc_se
where equipmentnumber in (0552094,1857547) and datemessage < '2016-03-01'; 

552094	36081	2016-02-29
552094	18943	2016-02-29
1857547	27956	2016-01-08
1857547	749597	2016-01-15

select equipmentnumber, dimensionid, datemessage from fact_rtc_se
where equipmentnumber in ('0552094','1857547') and datemessage < '2016-03-01';

1857547	27956	2016-01-08
1857547	749597	2016-01-15

2. Using STRING as the datatype for equipmentNumber:

create table if not exists fact_rtc_se
 (equipmentnumber string, dimensionid string, datemessage date) ;

insert into fact_rtc_se values ('0552094', 36081, '2016-02-29') ; 
insert into fact_rtc_se values ('0552094', 18943, '2016-02-29') ;
insert into fact_rtc_se values ('1857547', 27956, '2016-01-08') ;
insert into fact_rtc_se values ('1857547', 749597, '2016-01-15') ; 

select equipmentnumber, dimensionid, datemessage from fact_rtc_se
where equipmentnumber in (0552094,1857547) and datemessage < '2016-03-01'; 

1857547	27956	2016-01-08
1857547	749597	2016-01-15

select equipmentnumber, dimensionid, datemessage from fact_rtc_se
where equipmentnumber in ('0552094','1857547') and datemessage < '2016-03-01'; 

0552094	36081	2016-02-29
0552094	18943	2016-02-29
1857547	27956	2016-01-08
1857547	749597	2016-01-15

I hope this helps.

View solution in original post

3 REPLIES 3

avatar

Hi @Kaliyug Antagonist. I couldn't help but notice that the value in question had a leading zero.

Check the datatype used for the column called equipmentNumber and adjust the IN clause accordingly. It looks like equipmentNumber is defined as a String and thus you will have to quote the values inside the "IN" clause.

Here are some examples & results:

1. Using INT as the datatype for equipmentNumber:

create table if not exists fact_rtc_se
 (equipmentnumber int, dimensionid string, datemessage date) ;

insert into fact_rtc_se values (0552094, 36081, '2016-02-29') ; 
insert into fact_rtc_se values (0552094, 18943, '2016-02-29') ;
insert into fact_rtc_se values (1857547, 27956, '2016-01-08') ;
insert into fact_rtc_se values (1857547, 749597, '2016-01-15') ; 

select equipmentnumber, dimensionid, datemessage from fact_rtc_se
where equipmentnumber in (0552094,1857547) and datemessage < '2016-03-01'; 

552094	36081	2016-02-29
552094	18943	2016-02-29
1857547	27956	2016-01-08
1857547	749597	2016-01-15

select equipmentnumber, dimensionid, datemessage from fact_rtc_se
where equipmentnumber in ('0552094','1857547') and datemessage < '2016-03-01';

1857547	27956	2016-01-08
1857547	749597	2016-01-15

2. Using STRING as the datatype for equipmentNumber:

create table if not exists fact_rtc_se
 (equipmentnumber string, dimensionid string, datemessage date) ;

insert into fact_rtc_se values ('0552094', 36081, '2016-02-29') ; 
insert into fact_rtc_se values ('0552094', 18943, '2016-02-29') ;
insert into fact_rtc_se values ('1857547', 27956, '2016-01-08') ;
insert into fact_rtc_se values ('1857547', 749597, '2016-01-15') ; 

select equipmentnumber, dimensionid, datemessage from fact_rtc_se
where equipmentnumber in (0552094,1857547) and datemessage < '2016-03-01'; 

1857547	27956	2016-01-08
1857547	749597	2016-01-15

select equipmentnumber, dimensionid, datemessage from fact_rtc_se
where equipmentnumber in ('0552094','1857547') and datemessage < '2016-03-01'; 

0552094	36081	2016-02-29
0552094	18943	2016-02-29
1857547	27956	2016-01-08
1857547	749597	2016-01-15

I hope this helps.

avatar
Super Collaborator

bpreachuk you are right, the data type is string. The source RDBMS(both Oracle and SQL Server) handle this identically for both IN and OR operators but in Hive, the difference is surfacing only after a close look at the data. Any idea how to overcome this ? Since Sqoop is just using the source data types, I can't think of a way ...

avatar

Hi @Kaliyug Antagonist. This is a bit of a philosophical question.

The issue above occurs because the IN statement appears to do an implicit conversion of 0552094 to a numeric datatype and the IN statement does not find the row for 0552094. This implicit conversion is not what you want the IN statement to do. By explicitly quoting the numeric value, we do not allow the IN statement to do an Implicit conversion.

IMHO I would recommend that you *never* allow implicit conversions to take place - whether in the RDBMS world (SQL Server, Oracle) or in Hive. By always quoting string/date values (OR using CAST function to ensure the correct datatype) you will get the correct/optimal results and you will never be affected by an implicit conversion.

In the RDBMS world there are good of discussions about avoiding implicit conversions. RDBMSs do a much more thorough job of handling conversions, but even they are far from perfect when doing implicit conversions. An example of this - please see the Green/Yellow/Red chart of allowable conversions in this blog post by Jonathan Keyahias... https://www.sqlskills.com/blogs/jonathan/implicit-conversions-that-cause-index-scans/