Member since
02-15-2019
6
Posts
3
Kudos Received
0
Solutions
02-03-2020
01:01 PM
At my company, we are researching a lot the DV 2.0 data model and making some PoCs, but there isn't a lot of experiences on the web. I'm concerned about data replication (keeping data history in the Enterprise layer, almost replicating data in our data lake). Even though is not exclusively DV related, joins are costly and time-consuming with Hive and even with Impala. We already developed pyspark applications to reduce the time of this joins, getting interesting improvements, trying to get better times for constructing the staging, enterprise and data access layers. We are already using parquet files and partitioning. I would appreciate any experience you can share with me
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Impala
-
Apache Spark
10-09-2019
10:18 AM
1 Kudo
Great, thanks a lot for your answers @TimothySpann . SRM seems great and works out-of-the-box! In my case, the proposed architecture is based on 2 hot clusters, each one with their kafka brokers but each one consuming independently from the sources. If primary kafka cluster breaks, secondary kafka cluster has to keep ingesting data from sources, not losing (or minimizing) downtime and loss of data. As far I can see, with SRM if primary kafka cluster breaks there's still the situation where secondary kafka cluster has to ingest and data doesn't have to be lost
... View more
10-04-2019
02:46 PM
Hello, I'm looking your answer 3 years later because I'm in a similar situation :). In my company (telco) we're planning using 2 hot clusters with dual ingest because our RTO is demanding and we're looking for mechanisms to monitor and keep in sync both clusters. We ingest data in real-time with kafka + spark streaming, loading to HDFS and consuming with Hive/Impala. I'm thinking about a first approach making simple counts with Hive/Impala tables on both clusters each hour/half hour and comparing. If something is missing in one of the clusters, we will have to "manually" re-ingest the missing data (or copy it with cloudera BDR from one cluster to the other) and re-process enriched data. I'm wondering if have you dealt with similar scenarios or suggestions you may have. Thanks in advance!
... View more
10-04-2019
02:32 PM
Hello @TimothySpann , 3 years later from your post I'm in a similar situation :). Would you let me know how you solved it? In my company (telco) we're planning using 2 hot clusters with dual ingest because our RTO is demanding and looking for mechanisms to monitor and keep in sync both clusters
... View more