Created 01-02-2017 12:36 PM
HDP-2.5.0.0(2.5.0.0-1245)
Query-1 returns 43 rows which include both the equipment no.s 0552094 and :
select equipmentnumber, dimensionid, max(datemessage) maxdtmsg from group_datascientist.fact_rtc_se where (equipmentnumber = 0552094 or equipmentnumber = 1857547) and datemessage < '2016-03-01' group by equipmentnumber, dimensionid order by equipmentnumber, maxdtmsg;
+-------------------------+--------------------+-------------+--+ | equipmentnumber | dimensionid | maxdtmsg | +-------------------------+--------------------+-------------+--+ | 0552094 | 33393 | 2016-01-11 | | 0552094 | 23537 | 2016-01-15 | | 0552094 | 26115 | 2016-01-28 | | 0552094 | 23680 | 2016-01-29 | | 0552094 | 23664 | 2016-01-29 | | 0552094 | 73714 | 2016-01-29 | | 0552094 | 23530 | 2016-02-02 | | 0552094 | 23742 | 2016-02-03 | | 0552094 | 840502 | 2016-02-04 | | 0552094 | 322547 | 2016-02-18 | | 0552094 | 24234 | 2016-02-19 | | 0552094 | 24419 | 2016-02-19 | | 0552094 | 324917 | 2016-02-24 | | 0552094 | 1371670 | 2016-02-25 | | 0552094 | 156684 | 2016-02-26 | | 0552094 | 86745 | 2016-02-26 | | 0552094 | 687957 | 2016-02-28 | | 0552094 | 103128 | 2016-02-29 | | 0552094 | 36081 | 2016-02-29 | | 0552094 | 18943 | 2016-02-29 | | 1857547 | 927956 | 2016-01-08 | | 1857547 | 749597 | 2016-01-15 | | 1857547 | 312955 | 2016-01-15 | | 1857547 | 1117802 | 2016-01-20 | | 1857547 | 903606 | 2016-01-27 | | 1857547 | 196616 | 2016-01-29 | | 1857547 | 621571 | 2016-02-05 | | 1857547 | 175172 | 2016-02-08 | | 1857547 | 663615 | 2016-02-10 | | 1857547 | 194722 | 2016-02-11 | | 1857547 | 175415 | 2016-02-12 | | 1857547 | 241920 | 2016-02-15 | | 1857547 | 1292068 | 2016-02-15 | | 1857547 | 185040 | 2016-02-16 | | 1857547 | 181682 | 2016-02-17 | | 1857547 | 1234825 | 2016-02-18 | | 1857547 | 1444875 | 2016-02-18 | | 1857547 | 1175541 | 2016-02-19 | | 1857547 | 179475 | 2016-02-19 | | 1857547 | 1363760 | 2016-02-23 | | 1857547 | 203597 | 2016-02-24 | | 1857547 | 815551 | 2016-02-29 | | 1857547 | 18943 | 2016-02-29 | +-------------------------+--------------------+-------------+--+ 43 rows selected
In the query-2, I just used the IN operator for better readability but now only 23 rows are returned and only equip. no. 1857547 rows :
select equipmentnumber, dimensionid, max(datemessage) maxdtmsg from group_datascientist.fact_rtc_se where equipmentnumber IN(0552094,1857547) and datemessage < '2016-03-01' group by equipmentnumber, dimensionid order by equipmentnumber,maxdtmsg;
+-------------------------+--------------------+-------------+--+ | equipmentnumber | dimensionid | maxdtmsg | +-------------------------+--------------------+-------------+--+ | 1857547 | 927956 | 2016-01-08 | | 1857547 | 749597 | 2016-01-15 | | 1857547 | 312955 | 2016-01-15 | | 1857547 | 1117802 | 2016-01-20 | | 1857547 | 903606 | 2016-01-27 | | 1857547 | 196616 | 2016-01-29 | | 1857547 | 621571 | 2016-02-05 | | 1857547 | 175172 | 2016-02-08 | | 1857547 | 663615 | 2016-02-10 | | 1857547 | 194722 | 2016-02-11 | | 1857547 | 175415 | 2016-02-12 | | 1857547 | 241920 | 2016-02-15 | | 1857547 | 1292068 | 2016-02-15 | | 1857547 | 185040 | 2016-02-16 | | 1857547 | 181682 | 2016-02-17 | | 1857547 | 1234825 | 2016-02-18 | | 1857547 | 1444875 | 2016-02-18 | | 1857547 | 179475 | 2016-02-19 | | 1857547 | 1175541 | 2016-02-19 | | 1857547 | 1363760 | 2016-02-23 | | 1857547 | 203597 | 2016-02-24 | | 1857547 | 18943 | 2016-02-29 | | 1857547 | 815551 | 2016-02-29 | +-------------------------+--------------------+-------------+--+ 23 rows selected
What am I missing ?
Created 01-02-2017 10:47 PM
Hi @Kaliyug Antagonist. I couldn't help but notice that the value in question had a leading zero.
Check the datatype used for the column called equipmentNumber and adjust the IN clause accordingly. It looks like equipmentNumber is defined as a String and thus you will have to quote the values inside the "IN" clause.
Here are some examples & results:
1. Using INT as the datatype for equipmentNumber:
create table if not exists fact_rtc_se (equipmentnumber int, dimensionid string, datemessage date) ; insert into fact_rtc_se values (0552094, 36081, '2016-02-29') ; insert into fact_rtc_se values (0552094, 18943, '2016-02-29') ; insert into fact_rtc_se values (1857547, 27956, '2016-01-08') ; insert into fact_rtc_se values (1857547, 749597, '2016-01-15') ; select equipmentnumber, dimensionid, datemessage from fact_rtc_se where equipmentnumber in (0552094,1857547) and datemessage < '2016-03-01'; 552094 36081 2016-02-29 552094 18943 2016-02-29 1857547 27956 2016-01-08 1857547 749597 2016-01-15 select equipmentnumber, dimensionid, datemessage from fact_rtc_se where equipmentnumber in ('0552094','1857547') and datemessage < '2016-03-01'; 1857547 27956 2016-01-08 1857547 749597 2016-01-15
2. Using STRING as the datatype for equipmentNumber:
create table if not exists fact_rtc_se (equipmentnumber string, dimensionid string, datemessage date) ; insert into fact_rtc_se values ('0552094', 36081, '2016-02-29') ; insert into fact_rtc_se values ('0552094', 18943, '2016-02-29') ; insert into fact_rtc_se values ('1857547', 27956, '2016-01-08') ; insert into fact_rtc_se values ('1857547', 749597, '2016-01-15') ; select equipmentnumber, dimensionid, datemessage from fact_rtc_se where equipmentnumber in (0552094,1857547) and datemessage < '2016-03-01'; 1857547 27956 2016-01-08 1857547 749597 2016-01-15 select equipmentnumber, dimensionid, datemessage from fact_rtc_se where equipmentnumber in ('0552094','1857547') and datemessage < '2016-03-01'; 0552094 36081 2016-02-29 0552094 18943 2016-02-29 1857547 27956 2016-01-08 1857547 749597 2016-01-15
I hope this helps.
Created 01-02-2017 10:47 PM
Hi @Kaliyug Antagonist. I couldn't help but notice that the value in question had a leading zero.
Check the datatype used for the column called equipmentNumber and adjust the IN clause accordingly. It looks like equipmentNumber is defined as a String and thus you will have to quote the values inside the "IN" clause.
Here are some examples & results:
1. Using INT as the datatype for equipmentNumber:
create table if not exists fact_rtc_se (equipmentnumber int, dimensionid string, datemessage date) ; insert into fact_rtc_se values (0552094, 36081, '2016-02-29') ; insert into fact_rtc_se values (0552094, 18943, '2016-02-29') ; insert into fact_rtc_se values (1857547, 27956, '2016-01-08') ; insert into fact_rtc_se values (1857547, 749597, '2016-01-15') ; select equipmentnumber, dimensionid, datemessage from fact_rtc_se where equipmentnumber in (0552094,1857547) and datemessage < '2016-03-01'; 552094 36081 2016-02-29 552094 18943 2016-02-29 1857547 27956 2016-01-08 1857547 749597 2016-01-15 select equipmentnumber, dimensionid, datemessage from fact_rtc_se where equipmentnumber in ('0552094','1857547') and datemessage < '2016-03-01'; 1857547 27956 2016-01-08 1857547 749597 2016-01-15
2. Using STRING as the datatype for equipmentNumber:
create table if not exists fact_rtc_se (equipmentnumber string, dimensionid string, datemessage date) ; insert into fact_rtc_se values ('0552094', 36081, '2016-02-29') ; insert into fact_rtc_se values ('0552094', 18943, '2016-02-29') ; insert into fact_rtc_se values ('1857547', 27956, '2016-01-08') ; insert into fact_rtc_se values ('1857547', 749597, '2016-01-15') ; select equipmentnumber, dimensionid, datemessage from fact_rtc_se where equipmentnumber in (0552094,1857547) and datemessage < '2016-03-01'; 1857547 27956 2016-01-08 1857547 749597 2016-01-15 select equipmentnumber, dimensionid, datemessage from fact_rtc_se where equipmentnumber in ('0552094','1857547') and datemessage < '2016-03-01'; 0552094 36081 2016-02-29 0552094 18943 2016-02-29 1857547 27956 2016-01-08 1857547 749597 2016-01-15
I hope this helps.
Created 01-03-2017 08:25 AM
bpreachuk you are right, the data type is string. The source RDBMS(both Oracle and SQL Server) handle this identically for both IN and OR operators but in Hive, the difference is surfacing only after a close look at the data. Any idea how to overcome this ? Since Sqoop is just using the source data types, I can't think of a way ...
Created 01-03-2017 12:03 PM
Hi @Kaliyug Antagonist. This is a bit of a philosophical question.
The issue above occurs because the IN statement appears to do an implicit conversion of 0552094 to a numeric datatype and the IN statement does not find the row for 0552094. This implicit conversion is not what you want the IN statement to do. By explicitly quoting the numeric value, we do not allow the IN statement to do an Implicit conversion.
IMHO I would recommend that you *never* allow implicit conversions to take place - whether in the RDBMS world (SQL Server, Oracle) or in Hive. By always quoting string/date values (OR using CAST function to ensure the correct datatype) you will get the correct/optimal results and you will never be affected by an implicit conversion.
In the RDBMS world there are good of discussions about avoiding implicit conversions. RDBMSs do a much more thorough job of handling conversions, but even they are far from perfect when doing implicit conversions. An example of this - please see the Green/Yellow/Red chart of allowable conversions in this blog post by Jonathan Keyahias... https://www.sqlskills.com/blogs/jonathan/implicit-conversions-that-cause-index-scans/