Member since
01-24-2017
6
Posts
0
Kudos Received
0
Solutions
12-14-2017
02:29 PM
Using bullet (1.) we were able to fix the issue!
... View more
12-11-2017
02:31 PM
Did your workaround 1 solve your problem, or was that an unsuccessful attempt? It's interesting because i'm having the same issue, yet I was able to sqoop the same date range a few days ago but it was split-by a different column. Thanks!
... View more
09-21-2017
08:05 PM
This is interesting. Thank you for posting this resolution to this comparison as I was able to use a similar approach to my issue!
I am on HDP 2.6.1 and I had some similar unexpected behavior. When selecting a timestamp column, I wanted to grab all columns where the timestamp was within a certain date. However, when it returned the rows.... it "zeroed" the timestamp column to "yyyy-mm-dd 00:00:00" where ymd was the actual day. I found it bizarre since I only applied a function to my where clause and not my selected column. Here is an example: -- Unexpected query behavior
SELECT time_stamp_column
FROM my_table
WHERE date(time_stamp_column) = '2017-09-15'
limit 4;
--Unexpected results (no timestamps in table with values "00:00:00 0")
2017-09-15 00:00:00 0
2017-09-15 00:00:00 0
2017-09-15 00:00:00 0
2017-09-15 00:00:00 0
--Query using cast
SELECT time_stamp_column
FROM my_table
WHERE time_stamp_column >= '2017-09-15 00:00:00' AND
time_stamp_column < '2017-09-16 00:00:00'
limit 4;
--Expected/desired results (this is the actual data, as no columns have exactly the 00:00:00 timestamp)
2017-09-15 17:25:28.766248
2017-09-15 17:29:05.427199
2017-09-15 17:29:08.219565
2017-09-15 17:33:20.907088
Thanks @knarendran ! Hopefully someone else will find this useful.
... View more
09-12-2017
09:24 PM
HDP 2.6.1 is the version. Hive is telling me "hive.tez.dynamic.semijoin.reduction is undefined" and then "Query returned non-zero code: 1, cause: hive configuration hive.tez.dynamic.semijoin.reduction does not exists." if I try to set it to true. Thanks for the response Eugene! I was more or less experimenting with the cardinality check, but I will remove it to see if it alone affects performance. I did read that info when I enabled it, but it had said that it could impact performance so I tried it out. Considering the file size, would you say there is a max bucket size to watch for? It seems consensus is that there is no tried and true for bucket size.
... View more
09-12-2017
07:26 PM
Question: Can I reduce the time it takes for a merge to execute for a ~400 million row table? I'm using the ACID merge functionality similar to the following posts hive-acid-merge-by-example & update-hive-tables-easy-way-2 to track Slowly Changing Dimensions. This is working great to capture some time-series data and it is working as expected. This Hortonworks post says the merge took shorter than mine with more records in the staging table (less nodes too). I'd really like to cut down on the time it takes for this task to execute, though. It is taking ~30-50 minutes for 200k-500k records to be merged to a 400 million row destination table. DDL and SQL: (simplified) --Table containing new records from a source table
-- update_stamp tracks when the row was updated last in source
CREATE TABLE db.stg_tbl
(
id int,
nbr smallint,
column3 decimal(9,2),
update_stamp timestamp
)
STORED AS ORC;
--Destination table, tracking the valid_from and valid_to time
-- where a valid_to containing NULL means that is the current state
-- and a timestamp in this column signifying that
-- its a historical record with a new row showing the state of this id, nbr combo
CREATE TABLE db.dest_tbl
(
id int,
nbr smallint,
column3 decimal(9,2),
update_stamp timestamp,
VALID_FROM timestamp,
VALID_TO timestamp
)
CLUSTERED BY (id, nbr) INTO 24 BUCKETS
STORED AS ORC
TBLPROPERTIES("transactional"="true");
MERGE INTO db.dest_tbl target
USING(
--base staging
SELECT
stg_a.id as join_itm,
stg_a.nbr as join_lct,
stg_a.*
FROM
db.stg_tbl stg_a
UNION ALL
--Generate extra row for changed records where null
--join_id_nbr key means it will be inserted
SELECT
NULL,
NULL,
stg_b.*
FROM
db.stg_tbl stg_b
JOIN
db.dest_tbl tgt
ON
stg_b.id = tgt.id AND
stg_b.nbr = tgt.nbr
WHERE
(stg_b.update_stamp <> tgt.update_stamp) AND
tgt.valid_to IS NULL
) sub
ON
sub.join_id = target.id AND
sub.join_nbr = target.nbr
WHEN matched
AND sub.update_stamp <> target.update_stamp AND valid_to IS NULL
THEN UPDATE SET valid_to = current_timestamp()
WHEN not matched
THEN INSERT VALUES (sub.id, sub.nbr, sub.column3, current_timestamp(), NULL);
Note the above design of clustered by the 2 columns which is necessary, and the 24 buckets. I chose this number to be within one to three blocks per bucket/file, which is trying to remain within a 1-10 blocks per file recommendation I read. If I bump the buckets up to 48, just to double it and see the side effects, it does little to nothing. With some hive parameters below, I was able to see a 45% reduction in the time it ran (not sure if this had anything to do with others jobs running). 2909 seconds to 1547 seconds. My concern is that as I add more buckets it is only adding more small files that don't even satisfy one blocksize. Not to mention, the deltas that are created with the merge contain only the 200-500k records each time which is a small amount of data for the buckets. SET hive.merge.cardinality.check=false;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.vectorized.execution.enabled=false;
SET hive.vectorized.execution.reduce.enabled=false;
SET hive.enforce.bucketing=true;
SET hive.exec.parallel=true;
SET hive.auto.convert.join=false;
SET hive.enforce.bucketmapjoin=true;
SET hive.optimize.bucketmapjoin.sortedmerge=true;
SET hive.optimize.bucketmapjoin=true;
There is a 3rd reducer that seems to take the most time, and it has a low number of reducers (2, or 6 with the 48 bucket attempt) Any insight as to how I could create this table or tune the query would be helpful.
... View more
Labels:
01-24-2017
07:32 PM
(SOLUTION) Add spaces at the end of your file. I'm wondering if anyone has experienced a similar issue that we experienced when developing some Oozie workflows that contain hive2 actions that get executed by a sql script that we point to in the workflow. Hive2 action is executing a query from a file and does not read the last line of the file.
The scenario is that Falcon contains the value, passed in as a parameter to the Oozie workflow. The orchestration here happens just fine and the script is called, in some cases seemingly perfect. With a closer look in Ambari, when the script is running, the script executes with the last line of the file/script not being read at all by beeline. In the Oozie logs you can see the SQL script fine with all lines (and be confused as to why it failed or gave odd results). Only if you look in the Ambari Tez View where the lines are displayed that are running from the file, you might notice that the last line, using the example below, would display "FROM TABLE_ONE". This means that in my scenario the where will be ignored by the statement and the job may run fine, but it pulled all data instead of only from the partitioned wk_dt. <script>${some_sql_here}</script> in Oozie workflow
<property name="some_sql_here" value="/location/of/some_sql.sql"/> in Falcon
--Sample SQL Below in file "some_sql.sql"
INSERT INTO DESTINATION_TABLE
SELECT *
FROM TABLE_ONE
WHERE wk_dt = '2017-01-01';
Our solution has been to add 2 extra lines at the end of the file as a buffer to read all lines in the DML. It works fine this way. Let me know if anyone else has experienced this, as it has occurred every time I have used a hive2 action without adding additional blank lines at the end.
... View more