Member since
12-09-2015
13
Posts
3
Kudos Received
0
Solutions
12-26-2015
01:39 PM
please check, if set hive.map.aggr=true or not. Please change the table to ORC format and try to run only subquery. I hope, results should come fast. Please check, if table can be bucketed on emp_type or not. try to implement SMB join.
... View more
12-26-2015
12:57 PM
Hi Gopal, I need to join two big tables and then again join with another big table. I bucketed all 3 big tables and did not partition them and I was able to get good performance. Can you advise, how to do this using partition tables on Tez. Can you provide some examples. Document shared by you is really good, but I need to know this in detail. Can you provide some detailed documentation on Tez on join.
... View more
12-26-2015
12:48 PM
1 Kudo
Hi,
Sorry for being late on this. Yes, I am on MR only.
If i run simple sql. it creates common join. I see only 25 mappers only as it creates 25 splits.
while if i give sort merge bucket conditions like below. I see 1000 mappers (number of splits:1000). set hive.auto.convert.sortmerge.join=true;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true; I have also checked explain plan, it says "Sorted Merge Bucket Map Join Operator". Here are steps: As both the tables were having join on multiple columns. I concatenated those columns into one column keycol and created bucket/join key on that column. CREATE TABLE employee(
keycol string,
src_plfm_id int,
emp_id int,
birth_day string,
age int,
addr1 string,
addr2 string,
addr3 string,
phn_no int,
load_dt string)
PARTITIONED BY (
clnt_id string)
clustered by (keycol) sorted by (keycol asc) into 20 buckets
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0001'
LINES TERMINATED BY '\n'
STORED AS RCFILE;
(it is textfile format table)
from employee_text
insert overwrite table employee partition (clnt_id)
select concat(src_plfm_id,load_dt,emp_id,birth_day),
src_plfm_id,
emp_id,
birth_day,
age,
addr1,
addr2,
addr3,
phn_no,
load_dt
sort by concat(src_plfm_id,load_dt,emp_id,birth_day);
CREATE TABLE employee_error(
keycol string,
src_plfm_id int,
emp_id int,
birth_day string,
error_done string,
load_dt string)
PARTITIONED BY (
clnt_id string)
clustered by (keycol) sorted by (keycol asc) into 20 buckets
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0001'
LINES TERMINATED BY '\n'
STORED AS RCFILE;
from employee_error_text
insert overwrite table employee partition (clnt_id)
select concat(src_plfm_id,load_dt,emp_id,birth_day),
src_plfm_id,
emp_id,
birth_day,
error_done,
load_dt
sort by concat(src_plfm_id,load_dt,emp_id,birth_day);
create temporary table emp_error_temp
as select
a.src_plfm_id,
a.emp_id,
a.birth_day,
a.age,
a.addr1,
a.addr2,
a.addr3,
a.phn_no,
a.load_dt,
b.error_done
from employee a JOIN
employee_error b
on (a.keycol = b.keycol); This query is creating 50*20=1000 buckets. why?
... View more
12-20-2015
04:28 PM
Neeraj, I am not looking at reducers currently. I am worried about 1000 mappers. Does bucket map join works this way only? I mean: if there are 50 partitions and I create 20 buckets. Will it always create 1000 mappers. As my both the tables in join are big, I will have to use bucket map join only otherwise it is creating common join. I had run stats already on both tables.
... View more
12-19-2015
02:10 PM
2 Kudos
I have to join two large tables, so i am trying to use sort merge bucket map join. Both the tables are having partition on client id. There are 50 clients and I have bucketed both the tables in 20 buckets in another id field, which will be used in join. When I execute the SMB join, I see 50*20= 1000 mappers. Please advise, how to correct that.
... View more
Labels:
12-13-2015
04:54 PM
I am aware of this already. Need to learn performance improvement in hive in depth.
... View more
12-13-2015
04:53 PM
Sorry Neeraj, I have asked a practical question and i don't see answer of my question in above link.
... View more
12-13-2015
04:49 PM
Hi Guil, What is the benefit of running Hive QLs in spark-shell. Can't I change the hive.execution.engine to spark and do the querying? thanks
... View more
12-13-2015
04:40 PM
You don't have permission to apply any DDL operation on this table. Please check your access settings in sentry.
... View more
12-09-2015
05:07 PM
are you facing this issue, while trying to load data in a large table.
... View more
12-09-2015
03:57 PM
I am running
insert overwrite table emp_rc partition(year) select *,year_num from table1; table1 size is 1.8 GB and this table is textfile, while emp_rc is RCFile.
When i run this sql, it takes 1 hour, I have set mapreduce.job.reduces = 30 and now it takes 15 mins. Can you advise, what else I can do to improve performance. Can you explain how to analyse explain plan for such type of HQL.
... View more
- Tags:
- Data Processing
- Hive
Labels:
12-09-2015
03:54 PM
There are 3 tables. client table only having 300 records. I need this table to get client name.
person_detail table is having 180 million records and partition on client_id field. person_detail table is parquet.
emp_cat table has 1000 records. Need to put distinct as this table has dups. Insert overwrite directory ‘/apps/dev/report_output.txt’
select distinct
report_query1.Client,
report_query1.Name,
report_query1.Age,
report_query1.Address,
Query1.category
from
(select distinct
B.clientid Client,
A.client_name Name,
B.age,
B.address,
B.cat_id
from client A JOIN
(select distinct
client_id client_id,
age age,
address address,
cat_id cat_id
from person_detail
where client_id = ‘5454’
and entry_date > ‘2015-01-01’ and entry_date < ‘2015-03-31’) B
on (A.clientid = B.clientid)
) report_query1
LEFT OUTER JOIN (select distinct client_id, cat_id, category from emp_cat) Query1
on report_query1.cat_id = Query1.cat_id
and report_query1.client_id = Query1.client_id Question:
I am not able to read explain plan for this table. Can someone help.
This query output is coming in 30 mins. How can i improve performance. Can I make all 3 tables parquet.
I have to store data in some file. Is it right way or do we have some better way.
I have fetched data for person_detail table in subquery as this table is partitioned on client id and it will reduce the data for join. Is it right?
What else I can do to improve performance.
I have kept big table in right side for mapside join.
... View more
- Tags:
- Data Processing
- Hive
Labels: