Member since
02-15-2019
6
Posts
2
Kudos Received
0
Solutions
02-03-2020
01:01 PM
At my company, we are researching a lot the DV 2.0 data model and making some PoCs, but there isn't a lot of experiences on the web. I'm concerned about data replication (keeping data history in the Enterprise layer, almost replicating data in our data lake). Even though is not exclusively DV related, joins are costly and time-consuming with Hive and even with Impala. We already developed pyspark applications to reduce the time of this joins, getting interesting improvements, trying to get better times for constructing the staging, enterprise and data access layers. We are already using parquet files and partitioning. I would appreciate any experience you can share with me
... View more
Labels:
12-16-2019
10:54 AM
1 Kudo
Hello,
I need some guidance on how to model the partitions of headers and details of invoices using Hive.
Header tables I plan to store partitioned by year, month and day (e.g., year=2019/month=12/day=2)
Detail table:
Around 200 million records per day)
Have invoice number (unique) to relate to headers, but no dates included in the detail per se
I am thinking about 2 options:
Partitioning also by year, month, day. This needs a join to be performed in the data ingestion process, before storing. In this scenario, joins between header and details could be penalized because the join is based on the invoice number
Partitioning by a fixed amount of characters of the invoice number (unique). (e.g., for invoice number 123456789 a partition would be "12345"). I think this approach could be better because invoice numbers are sequential and thus "tightly" mapped with the dates (partition scheme of header table)
I also need to manage corner (dirty) cases, like invoice details with no valid invoice id: I'm thinking about a dedicated partition for these cases
Does somebody have some suggestions or experience with a related scenario?
Thanks a lot
... View more
- Tags:
- Hive
Labels:
12-16-2019
10:45 AM
1 Kudo
Hello, I need some guidance on how to model the partitions of headers and details of invoices using Hive. Header tables I plan to store partitioned by year, month and day (e.g., year=2019/month=12/day=2) Detail table: Around 200 million records per day) Have invoice number (unique) to relate to headers, but no dates included in the detail per se I am thinking about 2 options for detail table: Partitioning also by year, month, day. This needs a join to be performed in the data ingestion process, before storing. In this scenario, joins between header and details could be penalized because the join is based on the invoice number Partitioning by a fixed amount of characters of the invoice number (unique). (e.g., for invoice number 123456789 a partition would be "12345"). I think this approach could be better because invoice numbers are sequential and thus "tightly" mapped with the dates (partition scheme of header table) I also need to manage corner (dirty) cases, like invoice details with no valid invoice id: I'm thinking about a dedicated partition for these cases Does somebody have some suggestions or experience with a related scenario? Thanks a lot
... View more
- Tags:
- Hive
Labels:
10-09-2019
10:18 AM
Great, thanks a lot for your answers @TimothySpann . SRM seems great and works out-of-the-box! In my case, the proposed architecture is based on 2 hot clusters, each one with their kafka brokers but each one consuming independently from the sources. If primary kafka cluster breaks, secondary kafka cluster has to keep ingesting data from sources, not losing (or minimizing) downtime and loss of data. As far I can see, with SRM if primary kafka cluster breaks there's still the situation where secondary kafka cluster has to ingest and data doesn't have to be lost
... View more
10-04-2019
02:46 PM
Hello, I'm looking your answer 3 years later because I'm in a similar situation :). In my company (telco) we're planning using 2 hot clusters with dual ingest because our RTO is demanding and we're looking for mechanisms to monitor and keep in sync both clusters. We ingest data in real-time with kafka + spark streaming, loading to HDFS and consuming with Hive/Impala. I'm thinking about a first approach making simple counts with Hive/Impala tables on both clusters each hour/half hour and comparing. If something is missing in one of the clusters, we will have to "manually" re-ingest the missing data (or copy it with cloudera BDR from one cluster to the other) and re-process enriched data. I'm wondering if have you dealt with similar scenarios or suggestions you may have. Thanks in advance!
... View more
10-04-2019
02:32 PM
Hello @TimothySpann , 3 years later from your post I'm in a similar situation :). Would you let me know how you solved it? In my company (telco) we're planning using 2 hot clusters with dual ingest because our RTO is demanding and looking for mechanisms to monitor and keep in sync both clusters
... View more