Member since
09-25-2015
112
Posts
88
Kudos Received
12
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
9709 | 03-15-2017 03:17 PM | |
6120 | 02-17-2017 01:03 PM | |
1842 | 01-02-2017 10:47 PM | |
2740 | 11-16-2016 07:03 PM | |
1103 | 10-07-2016 05:24 PM |
06-29-2016
01:11 PM
Hi @Jan Kytara. Can you please update statistics on the table - run the command: analyze table corrupt_rows compute statistics ; Also would love to know if "select * from corrupt_rows limit nnn ;" returns properly formed rows with columns A..L, or if it has junk or boundaries. That could point to a delimiter issue.
... View more
06-28-2016
03:13 PM
Hi @Avinash P. Can you please confirm that table dg_2 does have a column called 'phone'?
... View more
06-21-2016
04:59 PM
2 Kudos
Hi @Kaliyug Antagonist. The answers above from @slachterman and @mqureshi are excellent. Here is another way (at a higher-level) to look at this problem. Here are some tips to plan out a DR strategy for the smoldering datacenter problem that is mentioned above. 1. Use the term Disaster Recovery instead of Backup. This gets the administrators to move away from the RDBMS-like idea that they can simply run a script and recover the entire cluster. 2. Discuss RTO/RPO and let the business answers drive the architecture. RTO and RPO requirements need to be defined by the
Business - these requirements drive all decisions around Disaster recovery A 1-hour/1-hour RTO/RPO is
wildly different (cost and architecture) from a 2-week/1-day RTO/RPO. When they choose the RTO/RPO requirements they are also choosing the required cost & architecture. By having well-defined RTO/RPO requirements you will avoid having an over-engineered solution (which may be far too expensive) and will also avoid having an under-engineered solution (which may fail precisely when you need it most - during a Disaster event) 3.'Band’ your data
assets into different categories for RTO/RPO purposes. Example: Band 1 = 1 hour RTO. Band 2 = 1 day RTO. Band 3 = 1 week RTO, Band
4 = 1 month RTO, Band 5 = Not required in the event of a
disaster You would be surprised how
much data can wait in the event of a SEVERE crash. For example, datasets that are used to provide a report that is distributed once per month - they should never require a 1-hour RTO. Hope that helps.
... View more
06-16-2016
04:17 PM
select TO_DATE(created_at),
DATEDIFF(TO_DATE(current_date()), TO_DATE(sales_flat_order.created_at)) as delay,
count(*) as NumberOfOrders
FROM
magentodb.sales_flat_order
WHERE
status IN ( 'packed' , 'cod_confirmed' )
GROUP BY TO_DATE(created_at),
DATEDIFF(TO_DATE(current_date()), TO_DATE(sales_flat_order.created_at))
... View more
06-16-2016
03:06 PM
1 Kudo
Hi @Simran Kaur. In your query you are trying to Group By "TO_DATE(created_at)" but the select statement does not retrieve that data. You are retrieving "created_at" and "DATEDIFF(TO_DATE(current_date()), TO_DATE(sales_flat_order.created_at))" If you add "TO_DATE(created_at)" to your select list or changed your select list to use "TO_DATE(created_at)" instead of "created_at"... it should work.
... View more
04-29-2016
01:04 PM
Hi @Ludovic Rouleau. There is one known bug that could be causing this. Can you rule this out: COMPONENT: Hive VERSION: HDP 2.2.4 (Hive 0.14 + patches) REFERENCE: BUG-35305 PROBLEM: With CTAS query or INSERT INTO TABLE query, after job finishes, data is moved into destination table with hadoop distcp job. IMPACT: Hive insert queries get slow SYMPTOMS: Hive insert queries get slow WORK AROUND: N/A SOLUTION:
By default this is set to false in HDP 2.2.4 onward. This issue is observed on upgrades to HDP 2.2.4 if the following configuration is set true in hive-site.xml, set it to false fs.hdfs.impl.disable.cache=false The above value is recommended true for HDP 2.2.0 to avoid HiveServer2 OutOfMemory issue
... View more
04-28-2016
01:53 PM
hi @Tom McCuch and @vamsi valiveti. Just wanted to clarify - it is legal to have two bucketed tables where the number of buckets in one table is a multiple of the number of buckets in the other table, but for pragmatic performance reasons it is best to have the number of buckets be the same. IMHO If you are going to bucket your data, you are doing it because you need a more efficient join - and having a non-matching number of buckets removes that ability to do a sort-merge bucket join. See this post on bucket join versus sort-merge bucket join. it's very good. http://stackoverflow.com/questions/20199077/hive-efficient-join-of-two-tables
... View more
04-22-2016
06:07 PM
Hi Andrea. I suspect the issue is as Kuldeep mentioned... But please do send the DDL and query - I would be glad to try to recreate it. bpreachuk@hortonworks.com
... View more
04-22-2016
05:29 PM
2 Kudos
Partitioning by Month is very acceptable, especially if the data comes in on a monthly basis. @Joseph Niemiec has written a great writeup on why you should use single Hive partitions like YYYYMMDD, YYYY-MM-DD, YYYYMM, YYYY-MM. Do this instead of nested partitions like YYYY/MM/DD or YYYY/MM.
The reason is fairly simple - it makes for simpler querying with LIKE, IN and BETWEEN... and the the Hive optimizer can do partition pruning on those queries.
... View more
04-20-2016
09:17 PM
Hi Andrea. Could you post the UNION ALL query you are executing along with the DDL for the tables involved? (via 'Show create table <tblname>')? That would help the debugging...
... View more