About sagar_girish

sagar_girish · ‎06-19-2017

@Sonny Heer So, we definitely need to use Sub-query in any case ( Group by or Windowing). And yes, Windowing is much faster than Group By, For the simple logic, Say you have 1 million rows, Group by will 1st Sort the data and then Group by the Key, whereas Windowing will just Sort and give you the 1st entry. However, If your dataset is not large enough, you can live with Group by. It will hardly make any difference. Can you please try and run both the queries (Windowing & Group by) and check a couple of things: 1. No. of Map task /Reduce tasks in both the queries. 2. If the Time Difference between 2 queries is more than 2 Mins, or it's almost the same.

sagar_girish · ‎06-16-2017

Hi @Sonny Heer, So what I understand from your query is you've got multiple tables say A,B,C,D,etc and your selecting a query joining on A left join B left join C , etc and there are Multiple entries in table B,C,D for the Key matching with A. If this is the case, What I would suggest you is to use Windowing function. Select A.a,B.b,C,c from A left join (Select * from ( Select B.b,B.key,ROW_NUMBER() OVER (partition by key) AS row_num from B) where row_num=1) B on A.key = B.key and so on.. Try this out and let me know if it was helpful. Cheers, Sagar

sagar_girish · ‎06-16-2017

Hi @Simran Kaur If you want to use this within the script, you can do the following. set hivevar:DATE=current_date; INSERT OVERWRITE DIRECTORY '/user/xyz/reports/oos_table_sales/${DATE}' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' SELECT * FROM outputs.oos_table_sale; Cheers, Sagar

sagar_girish · ‎06-12-2017

Hi @rama The below query will make your query run faster. insert into table dropme_master_6 select * from dropme_master_5 a left outer join dropme_master_6 b on a.consumer_sequence_id = b.consumer_sequence_id where b.consumer_sequence_id is null;

sagar_girish · ‎06-12-2017

Hi @rama , Please change your query to this: insert into table dropme_master_6 select * from dropme_master_5 a left outer join dropme_master_6 b on a.consumer_sequence_id = b.consumer_sequence_id where b.consumer_sequence_id is null; I am pretty confident this will improve your performance. Please let me know if it works and give me a thumps up. 🙂 Regards, Sagar Morakhia

sagar_girish · ‎06-12-2017

@Ravi Chinni insert into some_test_table select 'c1_val',named_struct('c2_a',array(cast (null as string)),'c2_c',cast (null as string)),array(named_struct('c3_a',cast (null as string) ,'c3_b',cast (null as string))) from z_dummy; This will work for you. Needless to say, please upvote if the answer was useful. 🙂

sagar_girish · ‎06-11-2017

Can you provide your sample input entry.?

sagar_girish · ‎06-10-2017

Hi @rama , First of all, I would suggest you to kill such jobs after 2-2.5 Hours, especially when your job finishes in half n hour on a normal day. 1 probable cause could be any other job is utilizing 90+% CPU, hence slowing down your job process. If you can provide me your entire query, I may be able to provide you few set parameters which will help running the query faster. Cheers, Sagar

sagar_girish · ‎06-10-2017

Perfect answer!

sagar_girish · ‎05-25-2017

Hi, Nice Article. I found a faster way of doing the same from Official documentation of sqoop. https://oozie.apache.org/docs/4.1.0/DG_WorkflowReRun.html So generally, rerunning the Oozie jobs are ad-hoc tasks and you may not want to create xml file just for re-running the job. So command line argument goes as below: oozie job -oozie http://localhost:11000/oozie -rerun 14-20090525161321-oozie-joe -Doozie.wf.rerun.skip.nodes=<> Example for the same oozie job -oozie http://localhost:11000/oozie -rerun 14-20090525161321-oozie-joe -Doozie.wf.rerun.skip.nodes=action1,action2,action3 where http://localhost:11000/oozie --> host where Oozie is running 14-20090525161321-oozi-joe --> is your Oozie Job name action1,action2,action3 --> are the steps that you want to skip. It is eventually doing the same thing as mentioned in the article, but with this, we don't have to create the config file. Cheers, Sagar

Online	Offline
Last Visited	‎08-29-2018 11:38 AM

Member Since	‎05-23-2017 09:26 AM
Last Visited	‎08-29-2018 11:38 AM
Posts	28
Kudos received	10

Cloudera Community

Re: Handling Multiple joins creating duplicates

Re: Handling Multiple joins creating duplicates

Re: Handling Multiple joins creating duplicates

Re: How to use current date as value for a variabl...

Re: Hi team , hive job taking so much time compare...

Re: Hi team , hive job taking so much time compare...

Re: How to insert NULL value into Hive complex col...

Re: Specifying delimiters for Hive table with nest...

Re: Hi team , hive job taking so much time compare...

Re: Load data to external table using date field a...

Re: How to re-run failed action from oozie workflo...