Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Expert Contributor

 

What to ask before making Data Flow

 

When taking on a new responsibility for designing and maintaining data flows. What are the main question one should ask to ensure a good outcome? 

 

Here I list the key questions for important topics, as well as an illustration of what typically goes wrong if the questions are not asked.

 

The most important points if you are under pressure to deliver

 

  1. Location
    • The questions: Where is the data, where should it go (and where can I process it). And of course: Do I have the required access
    • The Nightmare: Data is spread across multiple systems, one of these may not be identified. After you finally figure out which tables you need you try to start and don’t have access. When you finally get the data you either don’t have a compliant place to put it, or you are missing a tool. Finally you have the data but it is unclear how to get it written to the target. In the end a 3 day job takes 6 weeks.
  2. Context
    • The questions: What is the data, and who understands the source/target?
    • The Nightmare: You want to supply revenue data during business hours. First of all you get access to multiple tables, each containing various numbers which might be the revenue. After figuring out which one is the revenue, it turns out you have transactions from multiple timeones in and out of summer time which needs to be solved before moving it into the target application. Finally it turns out the target application needs fields not to be NULL and you have no idea what will happen if you use a wrong default.
  3.  Process
    • The questions: Who makes the specifications, and accepts the results. How to deal with the situation that the requirements change? (Or as it may be phrased, you did not understand them correctly). How to escalate if you are not put in circumstances where you can succeed?
    • The Nightmare: The requirements are not completely clear. You make something, and get feedback you need to change one thing. After this, you need to change another thing. It is unclear whether these are refinements (from your perspective) or fixes (from their perspective), however when the deadline is not met it is clear where the finger will be pointed.

 

The most important points if you want things to go right

 

  1. Complexity
    • The questions: What exactly should be the output, what exactly needs to be done?
    • The Nightmare: You build a data flow in Nifi, near the end the request comes to join two parts of the flow together, or do some complex windowing. Based on this kind of requirement you should have considered something like Spark, perhaps you need to redo some of the work to keep the flow logical, and introduce Kafka as well as a buffer in between.
  2. Supplier Commitment
    • The questions: Who supplies the data. What is the SLA. Will I be informed if the structure changes? Will these changes be available for testing? Is the data supplier responsible for data quality?
    • The Nightmare: You don't get a commitment, and suddenly your consumers start seeing wrong results. It turns out a column definition was changed and you were not informed. After this you get a message one of the smaller sources will be down for 12 hours, you need this to enrich your main source. So now you will be breaking the service level agreement to your consumers for a reason they may not want to understand.
  3.  Nonfunctionals
    • The questions: How big is the data, what is the expected througput. What is the required latency?
    • The Nightmare: You design and test a flow with 10 messages per second, and buffers to cushion the volatility. You end up receiving 10000 messages per second. For this you may even need a bigger budget. After your througput (budget_ has been increased significantly, it turns out the buffers are too big and your throughput SLA is not met. Now you can go back to request an even larger compute capability.

Of course there are other things to ask, such as requirements to work with specific (legacy) tooling, exact responsibilities per topic or security guidelines to abide by. But typically these are the things I consider to be the most critical and specific to working with data.

110 Views
0 Kudos
Tags (1)
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
2 of 2
Last update:
‎12-30-2019 09:42 AM
Updated by:
 
Contributors
Top Kudoed Authors