About bpreachuk

bpreachuk · ‎08-01-2016

Hi @Upendra Bhandari. FYI I updated my original answer to add more info about why to use string instead of varchar.

bpreachuk · ‎07-29-2016

Interesting that there is an upper bound for string datatype. Thanks @Kashif Khan.

bpreachuk · ‎07-29-2016

Personal anecdotal evidence - with EXTERNAL (externally managed) tables pointing to text files we saw no performance differences whether string/varchar/char. With ORC files we always used String as the datatype, so can't speak to any potential performance differences (except for vectorization - see below). I would recommend string if at all possible - You are correct that it is very handy to not be limited by a length specifier. Even if the data coming in is only Varchar(30) in length, your ELT/ETL processing will not fail if you send in 31 characters while using a string datatype. Also note that in Hive versions older than 0.14 (AKA on HDP 2.1 and older) varchar/char will not use Vectorization with your ORC files, Whereas String can use vectorization. That can be a big difference in query performance. Use string if possible. 😉 ** Edited to add more info about varchar vs. string ** Here is an example of what happens when you push too much data into a varchar(20) field. create table if not exists test_tbl (id int, stringval string, varchar20_val varchar(20)) ; # insert 20 characters of text into both fields insert into test_tbl values (1, '01234567890123456789', '01234567890123456789'); # insert 25 characters of text into string field, 20 chars into varchar(20) field insert into test_tbl values (2, '0123456789012345678901234', '01234567890123456789'); # insert 25 characters of text into both fields insert into test_tbl values (3, '0123456789012345678901234', '0123456789012345678901234'); # Note - row 3 varchar field has the last 5 characters silently truncated. This is bad. select id, length(stringval) as `length_string`, stringval, length(varchar20_val) as `length_varchar`, varchar20_val from test_tbl; The output is as follows: id length_string stringval length_varchar varchar20_val 1 20 01234567890123456789 20 01234567890123456789 2 25 0123456789012345678901234 20 01234567890123456789 3 25 0123456789012345678901234 20 01234567890123456789 Time taken: 1.584 seconds, Fetched: 3 row(s) In row 3 you have lost the last 5 characters of data without warning. IMHO this is the most compelling reason to use String instead of Varchar.

bpreachuk · ‎07-12-2016

OK. That's good. I have not seen any specific issue with NOT IN. Without seeing the data it stands to be asked again - You have confirmed that (a) count(*) on the inner query returns results, (b) that it is not failing silently, and (c) that there are 1+ email addresses in June do not exist prior to June... Here are 2 things to check to see if it is indeed a Hive issue... (1) Try running the query once with MapReduce as the execution engine and then with Tez as the execution engine and see if you get differing results. set hive.execution.engine=mr; set hive.execution.engine=tez; (2) Change the query to use a temporary table and evaluate your results that way (double-check my syntax but it should be close to correct... create temporary table temp_table as select customer_email FROM Table1 WHERE TO_DATE(created_at) < '2016-05-31'; select DISTINCT SF.customer_email FROM Magento.sales_flat_order SF WHERE YEAR(TO_DATE(SF.created_at)) = '2016' AND MONTH(TO_DATE(SF.created_at)) = '6' AND SF.customer_email NOT IN (select customer_email from temp_table); select DISTINCT SF.customer_email FROM Magento.sales_flat_order SF left join temp_table on SF.customer_email = temp_table.customer_email WHERE YEAR(TO_DATE(SF.created_at)) = '2016' AND MONTH(TO_DATE(SF.created_at)) = '6' AND temp_table.customer_email is NULL

bpreachuk · ‎07-12-2016

Hi @Simran Kaur. I notice this is a similar question as the last one you posted here: (https://community.hortonworks.com/questions/43885/not-in-operator-in-hive-with-a-subquery.html#answer-43944). I am assuming that back then you checked the results of the inner query and the outer query and that you are getting the results you expect. Your query above should return zero rows - you are looking for email addresses created in June of 2016 that are NOT in the list of email addresses created greater than May of 2016. This would return zero rows. So rather than an error - an empty set is the correct result.

bpreachuk · ‎07-07-2016

Hi @Simran Kaur. I added an answer to your other version of this question found here (https://community.hortonworks.com/questions/43885/not-in-operator-in-hive-with-a-subquery.html#answer-43944) One other thing I like to look for is leading or trailing blanks in the comparison field (email address). But in this instance it should not cause a problem since you are using the same table (and the same email addresses) in the inner and outer query.

bpreachuk · ‎07-07-2016

Hi @Simran Kaur. This query 'should' work. 😉 Quick things to double-check... 1. These 2 queries return counts > 0 select count(*) FROM Table1 SF WHERE YEAR(TO_DATE(SF.created_at)) = '2016' AND MONTH(TO_DATE(SF.created_at)) = '6' select count(*) FROM Table1 SFO WHERE TO_DATE(SFO.created_at) < '2016-05-31' 2. based on the counts returned - is it possible that there are no customers from June 2016 that did not receive an email prior to 2016 June? (i.e. the correct result is in fact zero)?

bpreachuk · ‎07-05-2016

Hi @Johnny Fuger. When you have a set of files in an existing directory structure, and you are not able to move the files around, there is a way to create a Hive table that is partitioned. You can manually define the partitions (explicitly). It is important to note that you are controlling each partition. You create the table, then add each partition manually via an ALTER TABLE command. Here is an example where there are 3 days worth of files in three different directories: directory #1 has 1 file (10 records total), the second directory has 2 files(20 records total), and the 3rd has 3 files(30 records total): hadoop fs -mkdir -p /user/test/data/2016-07-01 hadoop fs -mkdir -p /user/test/data/2016-07-02 hadoop fs -mkdir -p /user/test/data/2016-07-03 hadoop fs -put /tmp/poc_data_file.txt /user/test/data/2016-07-01 hadoop fs -put /tmp/poc_data_file.txt /user/test/data/2016-07-02/poc_data_file2.txt hadoop fs -put /tmp/poc_data_file.txt /user/test/data/2016-07-02 hadoop fs -put /tmp/poc_data_file.txt /user/test/data/2016-07-03 hadoop fs -put /tmp/poc_data_file.txt /user/test/data/2016-07-03/poc_data_file2.txt hadoop fs -put /tmp/poc_data_file.txt /user/test/data/2016-07-03/poc_data_file3.txt [root@sandbox hdfs]# hadoop fs -ls -R /user/test/data drwxr-xr-x - hdfs hdfs 0 2016-07-01 22:30 /user/test/data/2016-07-01 -rw-r--r-- 1 hdfs hdfs 1024 2016-07-01 22:30 /user/test/data/2016-07-01/poc_data_file.txt drwxr-xr-x - hdfs hdfs 0 2016-07-01 22:32 /user/test/data/2016-07-02 -rw-r--r-- 1 hdfs hdfs 1024 2016-07-01 22:32 /user/test/data/2016-07-02/poc_data_file.txt -rw-r--r-- 1 hdfs hdfs 1024 2016-07-01 22:31 /user/test/data/2016-07-02/poc_data_file2.txt drwxr-xr-x - hdfs hdfs 0 2016-07-01 22:32 /user/test/data/2016-07-03 -rw-r--r-- 1 hdfs hdfs 1024 2016-07-01 22:32 /user/test/data/2016-07-03/poc_data_file.txt -rw-r--r-- 1 hdfs hdfs 1024 2016-07-01 22:32 /user/test/data/2016-07-03/poc_data_file2.txt -rw-r--r-- 1 hdfs hdfs 1024 2016-07-01 22:32 /user/test/data/2016-07-03/poc_data_file3.txt Now create an external table with a partition clause. Note the rowcount is zero initially since we have not defined any partitions yet. create external table file_data_partitioned (id int, textval string, amount double) partitioned by (dateval string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/user/test/data'; select count(*) from file_data_partitioned; hive> select count(*) from file_data_partitioned; OK 0 Now manually define the 3 partitions on the data using ALTER TABLE commands. You need to specify the correct location for each partition. These partitions could be anywhere in HDFS. ----------------------------------------------- -- Add partitions manually ----------------------------------------------- alter table file_data_partitioned add partition (dateval = '2016-07-01') location '/user/test/data/2016-07-01'; alter table file_data_partitioned add partition (dateval = '2016-07-02') location '/user/test/data/2016-07-02'; alter table file_data_partitioned add partition (dateval = '2016-07-03') location '/user/test/data/2016-07-03'; --------------------------------------- -- Run statistics --------------------------------------- analyze table file_data_partitioned compute statistics ; Now we can see & query the data in each partition. hive> select dateval, count(*) > from file_data_partitioned > group by dateval; OK 2016-07-01 10 2016-07-02 20 2016-07-03 30 Important note though - if you choose this method of manual partitioning, you should always do it the same way each time you add data to the table. Otherwise you will get different directory structures in HDFS for the same table - data will be spread out among the cluster, which can get messy. Here's an example of this when you do an INSERT INTO command to create data for Partition 2017-07-31: insert into file_data_partitioned partition (dateval = '2016-07-31') select id, textval, amount from file_data_partitioned where dateval = '2016-07-01'; [root@sandbox hdfs]# hadoop fs -ls -R /user/test/data drwxr-xr-x - hdfs hdfs 0 2016-07-01 22:30 /user/test/data/2016-07-01 -rw-r--r-- 1 hdfs hdfs 1024 2016-07-01 22:30 /user/test/data/2016-07-01/poc_data_file.txt drwxr-xr-x - hdfs hdfs 0 2016-07-01 22:32 /user/test/data/2016-07-02 -rw-r--r-- 1 hdfs hdfs 1024 2016-07-01 22:32 /user/test/data/2016-07-02/poc_data_file.txt -rw-r--r-- 1 hdfs hdfs 1024 2016-07-01 22:31 /user/test/data/2016-07-02/poc_data_file2.txt drwxr-xr-x - hdfs hdfs 0 2016-07-01 22:32 /user/test/data/2016-07-03 -rw-r--r-- 1 hdfs hdfs 1024 2016-07-01 22:32 /user/test/data/2016-07-03/poc_data_file.txt -rw-r--r-- 1 hdfs hdfs 1024 2016-07-01 22:32 /user/test/data/2016-07-03/poc_data_file2.txt -rw-r--r-- 1 hdfs hdfs 1024 2016-07-01 22:32 /user/test/data/2016-07-03/poc_data_file3.txt drwxr-xr-x - hdfs hdfs 0 2016-07-05 16:53 /user/test/data/dateval=2016-07-31 -rwxr-xr-x 1 hdfs hdfs 182 2016-07-05 16:53 /user/test/data/dateval=2016-07-31/000000_0 Note the new directory created for 2016-07-31 and see that it has a different structure - the default structure that Hive uses when Hive controls partitioning ( ... /dateval=2016-07-31/ ...) I hope this helps.

bpreachuk · ‎07-01-2016

Hi @Avinash . Sorry for the delay in responding. I was able to do what you were attempting - please see below for explanation. I created the 2 tables (dg_1 and dg_2) and loaded them with data. By the way they needed to use datatype string - text won't work. create table dg_1 (id int, name string); create table dg_2(id int, phone string); insert into dg_1 values (1, 'Bill P'); insert into dg_2 values (1, '952-555-5555'); Now, you were attempting to create a function that processes these 2 tables and drops/inserts data into a 3rd table. Functions aren't really designed for this type of processing. Functions are User-Defined Functions (UDFs) - designed to do simple processing & return a value. Designed to be executed as part of a SQL expression. http://www.hplsql.org/create-function You should accomplish your task with a stored procedure instead. Here is a code sample that I just run inline (for ease of development). CREATE procedure load_dg_3 as BEGIN DROP TABLE if exists default.dg_join3 ; CREATE TABLE default.dg_join3 (name string, phone string) ; INSERT INTO TABLE default.dg_join3 SELECT a.name, b.phone from default.dg_1 a,default.dg_2 b where a.id = b.id ; SELECT * FROM default.dg_join3 ; print 'Completed...'; END; execute load_dg_3; > hplsql -f "load_dg_3" And the output. 16/06/30 22:15:11 INFO jdbc.Utils: Supplied authorities: localhost:10000 16/06/30 22:15:11 INFO jdbc.Utils: Resolved authority: localhost:10000 Open connection: jdbc:hive2://localhost:10000 (233 ms) Starting SQL statement SQL statement executed successfully (x.xx sec) Starting SQL statement SQL statement executed successfully (x.xx sec) Starting SQL statement SQL statement executed successfully (x.xx sec) Starting query Query executed successfully (xx ms) Bill P 952-555-5555 Completed... That should be what you're looking for. Hope this helps.

bpreachuk · ‎06-30-2016

Hi @Avinash P. With HPL/SQL your functions do not get stored in the Hive metastore. It behaves a little different than conventional Relational databases. There are 3 ways to execute the functions and/or make the functions permanent in Hive. 1. Add the functions and stored procedures to the global .hplsqlrc file 2. Include the functions in your code by using the INCLUDE statement (or add INCLUDE statements to the .hplsqlrc file) 3. Have the function/stored procedure code duplicated inside each of your scripts (only for testing, never the best way). http://www.hplsql.org/udf-sproc

Online	Offline
Last Visited	‎04-26-2019 11:02 AM

Member Since	‎09-25-2015 05:26 PM
Last Visited	‎04-26-2019 11:02 AM
Posts	112
Kudos received	85

Cloudera Community

Re: Does Hortonworks have EOL dates?

Re: I have multiple tables which i need to join an...

Re: Difference between WHERE ...OR & WHERE ... IN

Re: How to insert individual rows into hive based ...

Re: Why is there data limitation with LLAP?

Re: Hive STRING vs VARCHAR Performance

Re: Hive STRING vs VARCHAR Performance

Re: Hive STRING vs VARCHAR Performance

Re: NOT IN operator does not work as expected in h...

Re: NOT IN operator does not work as expected in h...

Re: hive query to check for entries added in a par...

Re: NOT IN operator in hive with a subquery

Re: Merge multiple directories into one table in H...

Re: HPL-SQL : Function creation issues in Hive 2.0

Re: HPL-SQL : Function creation issues in Hive 2.0