Member since
02-15-2019
6
Posts
3
Kudos Received
0
Solutions
12-13-2021
04:25 AM
I would say that experience of doing the same has not been very pleasant. To make the things worse, I inherited it from an existing team where the previous architect finally understood his blunder and left the organization. What he left behind was a poorly architected DV2.0 on Hive1.1. The solution was extremely slow. - We have now ended storing data in Parquet format. - The never ending slow jobs have been moved to PySpark, but still they take hours to complete even for few million records. - With lack of ACID support on Hive1.1, all dedupes are being done outside Hive using Spark. - We still keep running into issues of duplicates on multi-satellite scenarios. - The joins are still very time consuming. - Overall, it was never a good decision to implement a high normalized Data Model like DV2.0 on Hive which is optimized for denormalized data.
... View more
10-09-2019
10:18 AM
1 Kudo
Great, thanks a lot for your answers @TimothySpann . SRM seems great and works out-of-the-box! In my case, the proposed architecture is based on 2 hot clusters, each one with their kafka brokers but each one consuming independently from the sources. If primary kafka cluster breaks, secondary kafka cluster has to keep ingesting data from sources, not losing (or minimizing) downtime and loss of data. As far I can see, with SRM if primary kafka cluster breaks there's still the situation where secondary kafka cluster has to ingest and data doesn't have to be lost
... View more
10-04-2019
02:46 PM
Hello, I'm looking your answer 3 years later because I'm in a similar situation :). In my company (telco) we're planning using 2 hot clusters with dual ingest because our RTO is demanding and we're looking for mechanisms to monitor and keep in sync both clusters. We ingest data in real-time with kafka + spark streaming, loading to HDFS and consuming with Hive/Impala. I'm thinking about a first approach making simple counts with Hive/Impala tables on both clusters each hour/half hour and comparing. If something is missing in one of the clusters, we will have to "manually" re-ingest the missing data (or copy it with cloudera BDR from one cluster to the other) and re-process enriched data. I'm wondering if have you dealt with similar scenarios or suggestions you may have. Thanks in advance!
... View more