Member since
01-07-2019
220
Posts
23
Kudos Received
30
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4600 | 08-19-2021 05:45 AM | |
1694 | 08-04-2021 05:59 AM | |
834 | 07-22-2021 08:09 AM | |
3428 | 07-22-2021 08:01 AM | |
3133 | 07-22-2021 07:32 AM |
11-09-2021
09:13 PM
Introduction of Apache Kafka 3.0
Thanks to a collaborative effort between Cloudera and other parties, Apache Kafka continues to improve rapidly, and now, this has resulted in the release of Apache Kafka 3.0!
The top two questions now are:
What does this mean for the setup that I have today?
Why are we excited about Apache Kafka 3.0? Any challenges we should be aware of?
Let us address both of these.
What does this mean for the setup that I have today?
First of all, there is no need to be concerned about your current footprint.
Every time Cloudera announces a release, it also announces an end of support date. The availability of a new version does not detract from this.
Furthermore, the Kafka 2 branch is very much still alive, and we are still planning to release improvements to this for a good period of time.
Why are we excited about Apache Kafka 3.0? Any challenges we should be aware of?
In short, we see Kafka 3.0 as a good foundation. There are many areas with small improvements, but we believe the following two will have the largest impact:
KRaft
Significant improvements have been made to KRaft, an initiative that aims to absorb the complexity of coordination into the Kafka project itself, rather than leveraging Apache ZooKeeper. This is mostly relevant for customers who do not use any other solution that leverages Zookeeper.
At the time of writing, KRaft (kicked off by KIP-500 and enriched by initiatives such as KIP-630) is not yet ready for production, but perhaps by the time you read this article, a production-ready version is available in a 3.x release.
Big Cleanup
Though not directly valuable to end-users, a large clean-up has been executed, resulting in the deprecation (not yet discontinuation) of old message formats V0 and V1, so perhaps already stop using these for new developments. Users are also discouraged from setting up Kafka 3 clusters on versions below Java 11.
Conclusion
Kafka 3.0 offers a clean slate and a good foundation for enabling new developments. On the other hand the 3.0 version is clearly a .0 version and its main improvement is not ready for production usage, so the 2.x branch will stay in the spotlight a little longer. Cloudera will continue to keep developing and evaluating versions internally and as soon as a 3.x version becomes valuable for production, a Cloudera announcement will follow shortly!
For enthusiasts who want to know the details, the full list of improvements and changes can be found here.
Authored by
Dennis Jaheruddin (Global Streaming SME Lead) & Joseph Niemiec (Product Manager Kafka)
... View more
Labels:
08-22-2021
08:52 PM
In past years many companies have come up with (Hybrid) Cloud strategies, and there is no shortage of recommendations for when and how to use Cloud providers such as AWS, Azure, GCP, and others. However, most advice is geared towards a generic setting, and may not translate well to a Big Data context.
And here is the main reason why:
BIG Data requires BIG resources that have a BIG cost impact
This does not mean that any specific vendor is expensive, and it is not limited to a single deployment form such as PaaS or SaaS. It comes down to the fundamental economics that infrastructure provides must cover the costs they make, and if you use their solution for a big amount of data, these costs will be larger than for a small lightweight app.
In this context, hopefully, the following advice makes sense.
Avoid Network Usage
Perhaps the largest difference between a Big Data architecture, and a regular architecture is that it can make a large difference whether data is processed close to the source. This is often casually referred to as 'Data Gravity'.
If you get a large volume of data from a source in one location (E.g. On-premise or in a specific Cloud zone) it can be a very good idea to process the data close to the source. By filtering, aggregating, or otherwise reducing data before sending it across zones, data transfer fees can be avoided. Fun fact:
Buying a 10TB hard drive can cost you less than downloading 10TB from the Cloud.
Of course, the costs are less visible for uploading data to the cloud, but if your data comes from on-premises, you will easily find yourself needing to expand the internet capacity once the data volumes grow.
Flexible Cloud Infrastructure vs Economical On-Prem infrastructure
This may come as a shock, but even hyperscale Cloud providers such as AWS, Azure and GCP do not claim their infrastructure is cheaper. The careful observer can note that instead Cloud providers indicate that the TCO should be lower when using Cloud, rather than purely the infrastructure costs. Though this can certainly be true, one should realize that a cost analysis for a few lightweight applications will be different than for a heavy platform.
Cost
My personal rule of thumb is that the break-even point for a cheaper TCO is around 30% utilization. and though it depends on your company context and exact solution, I have never seen anyone assume a break-even point outside the 15%-60% range, which leads to the following disturbing point:
If your server utilization is near 100%, the TCO will increase when going to the Cloud
This is completely independent of the solution. The only exceptions that I found so far are if your on-premise licenses cost several times more than the underlying hardware, or if it is actually possible to shut down a poorly utilized data center.
Value
Of course, the cost is often not the key objective when starting a Cloud journey. There are many reasons for using the Cloud, such as infrastructure flexibility and ease of use. The value of these may in fact outweigh the total cost of any scenario. In a (Hybrid) cloud strategy, the trick is to identify the key value points and meet these without incurring explosive costs. For example:
Do some use cases have low average and huge peak loads? --> These seem like excellent candidates for the Cloud.
Do some process steps, such as Development, require more flexibility? --> These can also be excellent candidates for the Cloud.
Of course, this last point only applies if you have a consistent solution in the Cloud and On-premises, such as the Cloudera Data Platform. In short, the key takeaway is:
Identify use cases and process steps that benefit from flexibility, and bring these to the Cloud while keeping the TCO under control.
Stay in control of your data
A third reason why Cloud strategies do not automatically cover Big Data is that it is all about Data! In order to stay in control, it is important to ensure both accessibility and security.
Security
A general security architecture will think about infrastructure level security, perhaps even file-level security. However, in a big data world, we must go one level deeper, and really nail down data level security. It has become very common that business units may only see a limited set of rows (e.g. from their own unit), or columns (e.g. not sensitive data) from a single table. So it is great that one can define a security policy on a Cloud object storage bucket, or perhaps files within this, but that is really not sufficient anymore in this ever-changing world.
It is no longer sufficient to give permissions on the file or table level, security MUST be applied on rows and columns
As a result one can choose between two solutions:
Putting the data 'inside' a database solution so it cannot get accessed directly. This is what most Cloud-specific (and classical) on-premise database solutions do. However, in a Big Data context, this not only inflates run costs, but it also means there is no way to get data out except through the database engine. This makes integration possibilities limited and significantly increases the difficulty of ever leaving the solution behind.
Using open formats for the data, letting it live in a Cloud storage with proper and detailed security policies in place. An open solution such as the Cloudera Data Platform can facilitate this.
Accessibility
Especially when working with structured data (tables that can be queried with SQL), it can be very tempting to put the data in a database and assume it will always be extracted using the engine. However, especially in the Big Data world, the load that data processing solutions would put on these engines would be so large, that often direct data access is preferred. Rather than sending a query to the engine, the solution directly reads the files from the (Cloud) storage layer, gaining much speed and cost-efficiency.
In a Big Data world, queries should NOT always need to hit the query engine
Therefore it is really recommended to work with a (Database) solution that can write directly to accessible files on the Cloud native storage.
Conclusion
Though far from exhaustive, this hopefully illustrates that when making a (Hybrid) Cloud strategy, it is important to realize that there are some key challenges to overcome when working with Big Data. The Cloudera Data Platform makes things easier from a technology perspective, and this article has hopefully at least identified the points may require close attention. Of course, do reach out to your Cloudera contact when there are more detailed questions on how to enrich or fulfill the IT strategy of your company.
... View more
Labels:
08-19-2021
05:45 AM
Mindset Kafka is a popular message bus, which has many applications producing into it or consuming from it. Though Flink is not limited to Kafka, the primary source and sink is Kafka. Therefore, when looking for Flink-Kafka information, the best place is probably with your Flink supplier. Practical Though I do not fully understand the described security limitations, Cloudera has two ways to interact with flink. 1. With the SQL Streambuilder: A Graphical Interface to write SQL which automatically gets converted into a Flink job. When using the SQL streambuilder you only need to connect a cluster once, and then all your jobs can use it. 2. Directly with Flink: There are many examples available on connecting with Kafka, for example: https://docs.cloudera.com/csa/1.2.0/datastream-connectors/topics/csa-kafka.html Further information If this does not help, first of all make sure to use the right version of the software (It is always recommended to use the Cloudera distribution, which makes integration easier). If you are already using the Cloudera distribution of Flink and need further assistance, please contact the Cloudera Support organization, as they can help you with more detailed information.
... View more
08-04-2021
05:59 AM
1 Kudo
I think you are probably looking for the LookupAttribute processor. (Or possibly the LookupRecord processor). It should allow you to look up the value somewhere (e.g. in a database) and afterwards you can route on the outcome. Either directly based on matched/unmatched or if it is more complicated (e.g. the depending on what you actually found) perhaps using something like RouteOnAttribute.
... View more
08-04-2021
05:28 AM
Looking at the version, it seems you are using an unsupported version. (The 1.14 version is supported in some forms but not currently for the one that can be deployed on an arbitrary box). Most time people have problems with installation of software like NiFi the problem can be solved by using one of the supported versions, which can be found here: https://www.cloudera.com/downloads For NiFi specifically look under CDF (Cloudera Data Flow).
... View more
08-03-2021
02:05 PM
The general successor to Flume is NiFi. But if your usage is simple enough, then Kafka connect may also suffice. Cloudera of course supports both.
... View more
08-03-2021
02:04 PM
I would always recommend you to use the Cloudera distribution, as people like me are not able to troubleshoot the upstream distributions, and we do note that. it is common that people run into trouble when using upstream versions. I am not sure about the exact time, but if you are interested in Nifi on K8s, then rather than trying to solve all challenges personally you may also want to look into how the Cloudera Data Platform attacks this challenge for everyone.
... View more
08-03-2021
02:00 PM
I am honestly not sure, timezones are one of the most tricky topics so I cannot easily determine if there was a mistake that got fixed, or whether a new one is introduced. If you have problems accessing the support portal please reach out to the cloudera contact person for your company, best to fix your support account before you try to log a support case during a production down situation 😉
... View more
08-03-2021
01:48 PM
Not entirely sure if this was the issue (you seem to have run into it a few years ago), but it is important to understand the following: Hive data is stored on HDFS However, the security policies for HDFS and Hive may be different. In fact it is recommended that you do NOT give hdfs level permissions to anyone for the warehouse directory, and use ranger to give out permissions on the SQL level only to the databases and tables which live there. As a result you might have been comparing apples to pears (trying to validate access by doing a hdfs read, and then using beeline to do a table read which goes via different security policies).
... View more
07-22-2021
08:09 AM
I am not entirely sure what you are trying to achieve. If the set of data lives somewhere in a file you should be able to read it (e.g. list and fetchfile). If the data is generated by a script, is that just for testing? You could look at executescript or ExecuteGroovyScript, or perhaps just generate fake data directly with GenerateFlowFile (though the exact variety that you mention would be tough with the latter)
... View more