About josemartinezs

josemartinezs · ‎02-03-2020

At my company, we are researching a lot the DV 2.0 data model and making some PoCs, but there isn't a lot of experiences on the web. I'm concerned about data replication (keeping data history in the Enterprise layer, almost replicating data in our data lake). Even though is not exclusively DV related, joins are costly and time-consuming with Hive and even with Impala. We already developed pyspark applications to reduce the time of this joins, getting interesting improvements, trying to get better times for constructing the staging, enterprise and data access layers. We are already using parquet files and partitioning. I would appreciate any experience you can share with me

josemartinezs · ‎10-09-2019

Great, thanks a lot for your answers @TimothySpann . SRM seems great and works out-of-the-box! In my case, the proposed architecture is based on 2 hot clusters, each one with their kafka brokers but each one consuming independently from the sources. If primary kafka cluster breaks, secondary kafka cluster has to keep ingesting data from sources, not losing (or minimizing) downtime and loss of data. As far I can see, with SRM if primary kafka cluster breaks there's still the situation where secondary kafka cluster has to ingest and data doesn't have to be lost

josemartinezs · ‎10-04-2019

Hello, I'm looking your answer 3 years later because I'm in a similar situation :). In my company (telco) we're planning using 2 hot clusters with dual ingest because our RTO is demanding and we're looking for mechanisms to monitor and keep in sync both clusters. We ingest data in real-time with kafka + spark streaming, loading to HDFS and consuming with Hive/Impala. I'm thinking about a first approach making simple counts with Hive/Impala tables on both clusters each hour/half hour and comparing. If something is missing in one of the clusters, we will have to "manually" re-ingest the missing data (or copy it with cloudera BDR from one cluster to the other) and re-process enriched data. I'm wondering if have you dealt with similar scenarios or suggestions you may have. Thanks in advance!

josemartinezs · ‎10-04-2019

Hello @TimothySpann , 3 years later from your post I'm in a similar situation :). Would you let me know how you solved it? In my company (telco) we're planning using 2 hot clusters with dual ingest because our RTO is demanding and looking for mechanisms to monitor and keep in sync both clusters

Online	Offline
Last Visited	‎03-16-2020 12:17 PM

Member Since	‎02-15-2019 11:50 AM
Last Visited	‎03-16-2020 12:17 PM
Posts	6
Kudos received	3

Cloudera Community

Has someone implemented Data Vault 2.0 on Hadoop/H...

Re: Full Disaster Recovery with Multiple On-Premis...

Re: HDFS replication for DR

Re: Full Disaster Recovery with Multiple On-Premis...