About bpreachuk

bpreachuk · ‎06-29-2016

Hi @Jan Kytara. Can you please update statistics on the table - run the command: analyze table corrupt_rows compute statistics ; Also would love to know if "select * from corrupt_rows limit nnn ;" returns properly formed rows with columns A..L, or if it has junk or boundaries. That could point to a delimiter issue.

bpreachuk · ‎06-28-2016

Hi @Avinash P. Can you please confirm that table dg_2 does have a column called 'phone'?

bpreachuk · ‎06-21-2016

Hi @Kaliyug Antagonist. The answers above from @slachterman and @mqureshi are excellent. Here is another way (at a higher-level) to look at this problem. Here are some tips to plan out a DR strategy for the smoldering datacenter problem that is mentioned above. 1. Use the term Disaster Recovery instead of Backup. This gets the administrators to move away from the RDBMS-like idea that they can simply run a script and recover the entire cluster. 2. Discuss RTO/RPO and let the business answers drive the architecture. RTO and RPO requirements need to be defined by the Business - these requirements drive all decisions around Disaster recovery A 1-hour/1-hour RTO/RPO is wildly different (cost and architecture) from a 2-week/1-day RTO/RPO. When they choose the RTO/RPO requirements they are also choosing the required cost & architecture. By having well-defined RTO/RPO requirements you will avoid having an over-engineered solution (which may be far too expensive) and will also avoid having an under-engineered solution (which may fail precisely when you need it most - during a Disaster event) 3.'Band’ your data assets into different categories for RTO/RPO purposes. Example: Band 1 = 1 hour RTO. Band 2 = 1 day RTO. Band 3 = 1 week RTO, Band 4 = 1 month RTO, Band 5 = Not required in the event of a disaster You would be surprised how much data can wait in the event of a SEVERE crash. For example, datasets that are used to provide a report that is distributed once per month - they should never require a 1-hour RTO. Hope that helps.

bpreachuk · ‎06-16-2016

select TO_DATE(created_at), DATEDIFF(TO_DATE(current_date()), TO_DATE(sales_flat_order.created_at)) as delay, count(*) as NumberOfOrders FROM magentodb.sales_flat_order WHERE status IN ( 'packed' , 'cod_confirmed' ) GROUP BY TO_DATE(created_at), DATEDIFF(TO_DATE(current_date()), TO_DATE(sales_flat_order.created_at))

bpreachuk · ‎06-16-2016

Hi @Simran Kaur. In your query you are trying to Group By "TO_DATE(created_at)" but the select statement does not retrieve that data. You are retrieving "created_at" and "DATEDIFF(TO_DATE(current_date()), TO_DATE(sales_flat_order.created_at))" If you add "TO_DATE(created_at)" to your select list or changed your select list to use "TO_DATE(created_at)" instead of "created_at"... it should work.

bpreachuk · ‎04-29-2016

Hi @Ludovic Rouleau. There is one known bug that could be causing this. Can you rule this out: COMPONENT: Hive VERSION: HDP 2.2.4 (Hive 0.14 + patches) REFERENCE: BUG-35305 PROBLEM: With CTAS query or INSERT INTO TABLE query, after job finishes, data is moved into destination table with hadoop distcp job. IMPACT: Hive insert queries get slow SYMPTOMS: Hive insert queries get slow WORK AROUND: N/A SOLUTION: By default this is set to false in HDP 2.2.4 onward. This issue is observed on upgrades to HDP 2.2.4 if the following configuration is set true in hive-site.xml, set it to false fs.hdfs.impl.disable.cache=false The above value is recommended true for HDP 2.2.0 to avoid HiveServer2 OutOfMemory issue

bpreachuk · ‎04-28-2016

hi @Tom McCuch and @vamsi valiveti. Just wanted to clarify - it is legal to have two bucketed tables where the number of buckets in one table is a multiple of the number of buckets in the other table, but for pragmatic performance reasons it is best to have the number of buckets be the same. IMHO If you are going to bucket your data, you are doing it because you need a more efficient join - and having a non-matching number of buckets removes that ability to do a sort-merge bucket join. See this post on bucket join versus sort-merge bucket join. it's very good. http://stackoverflow.com/questions/20199077/hive-efficient-join-of-two-tables

bpreachuk · ‎04-22-2016

Hi Andrea. I suspect the issue is as Kuldeep mentioned... But please do send the DDL and query - I would be glad to try to recreate it. bpreachuk@hortonworks.com

bpreachuk · ‎04-22-2016

Partitioning by Month is very acceptable, especially if the data comes in on a monthly basis. @Joseph Niemiec has written a great writeup on why you should use single Hive partitions like YYYYMMDD, YYYY-MM-DD, YYYYMM, YYYY-MM. Do this instead of nested partitions like YYYY/MM/DD or YYYY/MM. The reason is fairly simple - it makes for simpler querying with LIKE, IN and BETWEEN... and the the Hive optimizer can do partition pruning on those queries.

bpreachuk · ‎04-20-2016

Hi Andrea. Could you post the UNION ALL query you are executing along with the DDL for the tables involved? (via 'Show create table <tblname>')? That would help the debugging...

Online	Offline
Last Visited	‎04-26-2019 11:02 AM

Member Since	‎09-25-2015 05:26 PM
Last Visited	‎04-26-2019 11:02 AM
Posts	112
Kudos received	85

Cloudera Community

Re: Does Hortonworks have EOL dates?

Re: I have multiple tables which i need to join an...

Re: Difference between WHERE ...OR & WHERE ... IN

Re: How to insert individual rows into hive based ...

Re: Why is there data limitation with LLAP?

Re: Hive corrupting or displaying data corruptly

Re: HPL-SQL : Function creation issues in Hive 2.0

Re: Cluster 'back-up' - does it make sense ?

Re: group by date part of datetime and get number ...

Re: group by date part of datetime and get number ...

Re: Distcp job after Hive job

Re: Hive Bucket clarification

Re: Error while processing statement in Hive using...

Re: Best Pratices for Hive Partitioning especially...

Re: Error while processing statement in Hive using...