12-29-2017 03:54 AM
I'm new to cloudera, but i'm significantly surprised that cloudera's hadoop distribution doesn't support versions of hive later than 1.1.0.
Very many changes were done since this version, that affects performance, support of SQL commands (UNION inspite of UNION ALL) and etc.
Maybe there is something that can be used insted of Hive to store and manipulate data in SQL-way?
I'm asking because i can't believe that Cloudera can't include latest version of Hive, and i think that some other solution is used for this purposes.
Best regards, Daniil.
01-02-2018 02:44 AM
01-02-2018 02:51 AM
01-08-2018 11:29 PM
I'v read about the things you'v written before. And i hope that all innovations and optimizations done by hive developers are applied in hive distributed with cloudera.
select * from clickstream_csv union select * FROM clickstream_bad LIMIT 100;
Error while compiling statement: FAILED: ParseException line 3:0 missing ALL at 'select' near '<EOF>'
So. The union statement can not be used. For sure.
And this makes me to doubt about inclusion of changes done by Hive developers since version 1.1
My current cloudera distribution is 5.12.1.
With best regards, Daniil.
01-09-2018 08:13 AM
Hive 1.1 (CDH 5.4+) only offers UNION ALL (bag union), in which duplicate rows are not eliminated. Starting with Hive 1.2, the UNION DISTINCT feature was introduced and if no UNION type was explictly specified, the default UNION operation is DISTINCT. However, with the introduction of this new UNION DISTINCT capability came some other subtle changes to how the UNION ALL feature worked. We are unable to introduce those changes into CDH 5 for risk of affecting existing workloads. It will be available in CDH 6.
In CDH 5, there is only support UNION is UNION ALL. If it fulfils your business requirements, please include the ALL statement.
select * from clickstream_csv UNION ALL select * FROM clickstream_bad LIMIT 100;
You may then pass it through a DISTINCT clause to achieve the same affect.
select distinct(salary) from ( select salary from sample_07 union ALL select salary from sample_08) z;
01-09-2018 10:27 PM
Thanks for your response.
I knew the difference between UNION and UNION ALL, and how to eliminate duplicates using UNION ALL statement combined with DISTINCT statement.
New thing for me is that you are going to use newer version in CDH 6. That's good. Looking forward to it.
Can i check somewhere CDH release roadmap?
01-10-2018 06:04 AM
We do not have a publicly available roadmap for CDH 6 yet. And while nothing is final until it final, I think it's safe to say that we will be upgrading to at least Hive 1.2, which includes this requested feature.