Member since
09-25-2015
112
Posts
88
Kudos Received
12
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
9711 | 03-15-2017 03:17 PM | |
6126 | 02-17-2017 01:03 PM | |
1843 | 01-02-2017 10:47 PM | |
2742 | 11-16-2016 07:03 PM | |
1106 | 10-07-2016 05:24 PM |
08-01-2016
11:27 AM
1 Kudo
Hi @Upendra Bhandari. FYI I updated my original answer to add more info about why to use string instead of varchar.
... View more
07-29-2016
04:33 PM
Interesting that there is an upper bound for string datatype. Thanks @Kashif Khan.
... View more
07-29-2016
04:31 PM
8 Kudos
Personal anecdotal evidence - with EXTERNAL (externally managed) tables pointing to text files we saw no performance differences whether string/varchar/char. With ORC files we always used String as the datatype, so can't speak to any potential performance differences (except for vectorization - see below). I would recommend string if at all possible - You are correct that it is very handy to not be limited by a length specifier. Even if the data coming in is only Varchar(30) in length, your ELT/ETL processing will not fail if you send in 31 characters while using a string datatype. Also note that in Hive versions older than 0.14 (AKA on HDP 2.1 and older) varchar/char will not use Vectorization with your ORC files, Whereas String can use vectorization. That can be a big difference in query performance. Use string if possible. 😉 ** Edited to add more info about varchar vs. string ** Here is an example of what happens when you push too much data into a varchar(20) field. create table if not exists test_tbl (id int, stringval string, varchar20_val varchar(20)) ;
# insert 20 characters of text into both fields
insert into test_tbl values (1, '01234567890123456789', '01234567890123456789');
# insert 25 characters of text into string field, 20 chars into varchar(20) field
insert into test_tbl values (2, '0123456789012345678901234', '01234567890123456789');
# insert 25 characters of text into both fields
insert into test_tbl values (3, '0123456789012345678901234', '0123456789012345678901234');
# Note - row 3 varchar field has the last 5 characters silently truncated. This is bad.
select id,
length(stringval) as `length_string`, stringval,
length(varchar20_val) as `length_varchar`, varchar20_val
from test_tbl; The output is as follows: id length_string stringval length_varchar varchar20_val
1 20 01234567890123456789 20 01234567890123456789
2 25 0123456789012345678901234 20 01234567890123456789
3 25 0123456789012345678901234 20 01234567890123456789
Time taken: 1.584 seconds, Fetched: 3 row(s) In row 3 you have lost the last 5 characters of data without warning. IMHO this is the most compelling reason to use String instead of Varchar.
... View more
07-12-2016
02:08 PM
OK. That's good. I have not seen any specific issue with NOT IN. Without seeing the data it stands to be asked again - You have confirmed that (a) count(*) on the inner query returns results, (b) that it is not failing silently, and (c) that there are 1+ email addresses in June do not exist prior to June... Here are 2 things to check to see if it is indeed a Hive issue... (1) Try running the query once with MapReduce as the execution engine and then with Tez as the execution engine and see if you get differing results. set hive.execution.engine=mr; set hive.execution.engine=tez; (2) Change the query to use a temporary table and evaluate your results that way (double-check my syntax but it should be close to correct... create temporary table temp_table
as select customer_email FROM Table1
WHERE TO_DATE(created_at) < '2016-05-31';
select DISTINCT SF.customer_email
FROM Magento.sales_flat_order SF
WHERE YEAR(TO_DATE(SF.created_at)) = '2016'
AND MONTH(TO_DATE(SF.created_at)) = '6'
AND SF.customer_email NOT IN (select customer_email from temp_table);
select DISTINCT SF.customer_email
FROM Magento.sales_flat_order SF
left join temp_table
on SF.customer_email = temp_table.customer_email
WHERE YEAR(TO_DATE(SF.created_at)) = '2016'
AND MONTH(TO_DATE(SF.created_at)) = '6'
AND temp_table.customer_email is NULL
... View more
07-12-2016
01:36 PM
Hi @Simran Kaur. I notice this is a similar question as the last one you posted here: (https://community.hortonworks.com/questions/43885/not-in-operator-in-hive-with-a-subquery.html#answer-43944). I am assuming that back then you checked the results of the inner query and the outer query and that you are getting the results you expect. Your query above should return zero rows - you are looking for email addresses created in June of 2016 that are NOT in the list of email addresses created greater than May of 2016. This would return zero rows. So rather than an error - an empty set is the correct result.
... View more
07-07-2016
01:45 PM
Hi @Simran Kaur. I added an answer to your other version of this question found here (https://community.hortonworks.com/questions/43885/not-in-operator-in-hive-with-a-subquery.html#answer-43944) One other thing I like to look for is leading or trailing blanks in the comparison field (email address). But in this instance it should not cause a problem since you are using the same table (and the same email addresses) in the inner and outer query.
... View more
07-07-2016
01:39 PM
Hi @Simran Kaur. This query 'should' work. 😉 Quick things to double-check... 1. These 2 queries return counts > 0 select count(*)
FROM Table1 SF
WHERE YEAR(TO_DATE(SF.created_at)) = '2016'
AND MONTH(TO_DATE(SF.created_at)) = '6' select count(*) FROM Table1 SFO
WHERE TO_DATE(SFO.created_at) < '2016-05-31' 2. based on the counts returned - is it possible that there are no customers from June 2016 that did not receive an email prior to 2016 June? (i.e. the correct result is in fact zero)?
... View more
07-05-2016
05:32 PM
1 Kudo
Hi @Johnny Fuger. When you have a set of files in an existing directory structure, and you are not able to move the files around, there is a way to create a Hive table that is partitioned. You can manually define the partitions (explicitly). It is important to note that you are controlling each partition. You create the table, then add each partition manually via an ALTER TABLE command. Here is an example where there are 3 days worth of files in three different directories: directory #1 has 1 file (10 records total), the second directory has 2 files(20 records total), and the 3rd has 3 files(30 records total): hadoop fs -mkdir -p /user/test/data/2016-07-01
hadoop fs -mkdir -p /user/test/data/2016-07-02
hadoop fs -mkdir -p /user/test/data/2016-07-03
hadoop fs -put /tmp/poc_data_file.txt /user/test/data/2016-07-01
hadoop fs -put /tmp/poc_data_file.txt /user/test/data/2016-07-02/poc_data_file2.txt
hadoop fs -put /tmp/poc_data_file.txt /user/test/data/2016-07-02
hadoop fs -put /tmp/poc_data_file.txt /user/test/data/2016-07-03
hadoop fs -put /tmp/poc_data_file.txt /user/test/data/2016-07-03/poc_data_file2.txt
hadoop fs -put /tmp/poc_data_file.txt /user/test/data/2016-07-03/poc_data_file3.txt
[root@sandbox hdfs]# hadoop fs -ls -R /user/test/data
drwxr-xr-x - hdfs hdfs 0 2016-07-01 22:30 /user/test/data/2016-07-01
-rw-r--r-- 1 hdfs hdfs 1024 2016-07-01 22:30 /user/test/data/2016-07-01/poc_data_file.txt
drwxr-xr-x - hdfs hdfs 0 2016-07-01 22:32 /user/test/data/2016-07-02
-rw-r--r-- 1 hdfs hdfs 1024 2016-07-01 22:32 /user/test/data/2016-07-02/poc_data_file.txt
-rw-r--r-- 1 hdfs hdfs 1024 2016-07-01 22:31 /user/test/data/2016-07-02/poc_data_file2.txt
drwxr-xr-x - hdfs hdfs 0 2016-07-01 22:32 /user/test/data/2016-07-03
-rw-r--r-- 1 hdfs hdfs 1024 2016-07-01 22:32 /user/test/data/2016-07-03/poc_data_file.txt
-rw-r--r-- 1 hdfs hdfs 1024 2016-07-01 22:32 /user/test/data/2016-07-03/poc_data_file2.txt
-rw-r--r-- 1 hdfs hdfs 1024 2016-07-01 22:32 /user/test/data/2016-07-03/poc_data_file3.txt
Now create an external table with a partition clause. Note the rowcount is zero initially since we have not defined any partitions yet. create external table file_data_partitioned (id int, textval string, amount double)
partitioned by (dateval string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LOCATION '/user/test/data';
select count(*) from file_data_partitioned;
hive> select count(*) from file_data_partitioned;
OK
0 Now manually define the 3 partitions on the data using ALTER TABLE commands. You need to specify the correct location for each partition. These partitions could be anywhere in HDFS. -----------------------------------------------
-- Add partitions manually
-----------------------------------------------
alter table file_data_partitioned add partition (dateval = '2016-07-01')
location '/user/test/data/2016-07-01';
alter table file_data_partitioned add partition (dateval = '2016-07-02')
location '/user/test/data/2016-07-02';
alter table file_data_partitioned add partition (dateval = '2016-07-03')
location '/user/test/data/2016-07-03';
---------------------------------------
-- Run statistics
---------------------------------------
analyze table file_data_partitioned compute statistics ; Now we can see & query the data in each partition. hive> select dateval, count(*)
> from file_data_partitioned
> group by dateval;
OK
2016-07-01 10
2016-07-02 20
2016-07-03 30 Important note though - if you choose this method of manual partitioning, you should always do it the same way each time you add data to the table. Otherwise you will get different directory structures in HDFS for the same table - data will be spread out among the cluster, which can get messy. Here's an example of this when you do an INSERT INTO command to create data for Partition 2017-07-31: insert into file_data_partitioned partition (dateval = '2016-07-31')
select id, textval, amount
from file_data_partitioned
where dateval = '2016-07-01';
[root@sandbox hdfs]# hadoop fs -ls -R /user/test/data
drwxr-xr-x - hdfs hdfs 0 2016-07-01 22:30 /user/test/data/2016-07-01
-rw-r--r-- 1 hdfs hdfs 1024 2016-07-01 22:30 /user/test/data/2016-07-01/poc_data_file.txt
drwxr-xr-x - hdfs hdfs 0 2016-07-01 22:32 /user/test/data/2016-07-02
-rw-r--r-- 1 hdfs hdfs 1024 2016-07-01 22:32 /user/test/data/2016-07-02/poc_data_file.txt
-rw-r--r-- 1 hdfs hdfs 1024 2016-07-01 22:31 /user/test/data/2016-07-02/poc_data_file2.txt
drwxr-xr-x - hdfs hdfs 0 2016-07-01 22:32 /user/test/data/2016-07-03
-rw-r--r-- 1 hdfs hdfs 1024 2016-07-01 22:32 /user/test/data/2016-07-03/poc_data_file.txt
-rw-r--r-- 1 hdfs hdfs 1024 2016-07-01 22:32 /user/test/data/2016-07-03/poc_data_file2.txt
-rw-r--r-- 1 hdfs hdfs 1024 2016-07-01 22:32 /user/test/data/2016-07-03/poc_data_file3.txt
drwxr-xr-x - hdfs hdfs 0 2016-07-05 16:53 /user/test/data/dateval=2016-07-31
-rwxr-xr-x 1 hdfs hdfs 182 2016-07-05 16:53 /user/test/data/dateval=2016-07-31/000000_0
Note the new directory created for 2016-07-31 and see that it has a different structure - the default structure that Hive uses when Hive controls partitioning ( ... /dateval=2016-07-31/ ...) I hope this helps.
... View more
07-01-2016
04:25 PM
Hi @Avinash . Sorry for the delay in responding. I was able to do what you were attempting - please see below for explanation. I created the 2 tables (dg_1 and dg_2) and loaded them with data. By the way they needed to use datatype string - text won't work. create table dg_1 (id int, name string);
create table dg_2(id int, phone string);
insert into dg_1 values (1, 'Bill P');
insert into dg_2 values (1, '952-555-5555'); Now, you were attempting to create a function that processes these 2 tables and drops/inserts data into a 3rd table. Functions aren't really designed for this type of processing. Functions are User-Defined Functions (UDFs) - designed to do simple processing & return a value. Designed to be executed as part of a SQL expression. http://www.hplsql.org/create-function You should accomplish your task with a stored procedure instead. Here is a code sample that I just run inline (for ease of development). CREATE procedure load_dg_3
as
BEGIN
DROP TABLE if exists default.dg_join3 ;
CREATE TABLE default.dg_join3 (name string, phone string) ;
INSERT INTO TABLE default.dg_join3 SELECT a.name, b.phone from default.dg_1 a,default.dg_2 b where a.id = b.id ;
SELECT * FROM default.dg_join3 ;
print 'Completed...';
END;
execute load_dg_3;
> hplsql -f "load_dg_3"
And the output. 16/06/30 22:15:11 INFO jdbc.Utils: Supplied authorities: localhost:10000
16/06/30 22:15:11 INFO jdbc.Utils: Resolved authority: localhost:10000
Open connection: jdbc:hive2://localhost:10000 (233 ms)
Starting SQL statement
SQL statement executed successfully (x.xx sec)
Starting SQL statement
SQL statement executed successfully (x.xx sec)
Starting SQL statement
SQL statement executed successfully (x.xx sec)
Starting query
Query executed successfully (xx ms)
Bill P 952-555-5555
Completed... That should be what you're looking for. Hope this helps.
... View more
06-30-2016
02:17 PM
Hi @Avinash P. With HPL/SQL your functions do not get stored in the Hive metastore. It behaves a little different than conventional Relational databases. There are 3 ways to execute the functions and/or make the functions permanent in Hive. 1. Add the functions and stored procedures to the global .hplsqlrc file 2. Include the functions in your code by using the INCLUDE statement (or add INCLUDE statements to the .hplsqlrc file) 3. Have the function/stored procedure code duplicated inside each of your scripts (only for testing, never the best way). http://www.hplsql.org/udf-sproc
... View more