About DennisJaheruddi

skommineni · ‎08-23-2023

@DianaTorres - Sure will do..thanks.

ralvad · ‎05-25-2023

Hi @PBS I am looking for a similar solution. Did you manage to find a solution for this?

SAMZAG · ‎06-02-2022

You should use "vacum" command first (this will delete transactions history in deltalog) After that both above statements should produce same results.

Corasmum · ‎03-31-2022

Fetch needs a filename to be able to fetch the file. If known you can set it as an attribute on the file or a parameter context to pass in as the filename. If you don't know what it will be then you'll need to use a list processor.

Carlb · ‎02-04-2022

Late response but... Are you running in FIPS mode?

RodrigoHDP · ‎12-17-2021

Dennis, Thanks for your kind answer, but -> I'm not a company, I'm a person trying to use open source software to learn and contribute like anyone else ... if Cloudera claims that his software is open source so why an agreement is needed? It shouldn't, it defeats completely the purpose. I'm not signing any "agreements", either the code is free on NOT, as simple as that!! If NOT, then just say so ... Stop using "open source" as a fashion flag, this will simply diminish the real coders that produce and put their code available REALLY FOR FREE!! If Cloudera got the free code available and built upon on and now wants to sell it, OK, I have no problem with that, but DONT say it is opne source, because it isn't if you charge for it ... Sorry, but this is so obvious to me ... am I so blind, dumb or missing anything here?

DennisJaheruddi · ‎11-09-2021

Introduction of Apache Kafka 3.0 Thanks to a collaborative effort between Cloudera and other parties, Apache Kafka continues to improve rapidly, and now, this has resulted in the release of Apache Kafka 3.0! The top two questions now are: What does this mean for the setup that I have today? Why are we excited about Apache Kafka 3.0? Any challenges we should be aware of? Let us address both of these. What does this mean for the setup that I have today? First of all, there is no need to be concerned about your current footprint. Every time Cloudera announces a release, it also announces an end of support date. The availability of a new version does not detract from this. Furthermore, the Kafka 2 branch is very much still alive, and we are still planning to release improvements to this for a good period of time. Why are we excited about Apache Kafka 3.0? Any challenges we should be aware of? In short, we see Kafka 3.0 as a good foundation. There are many areas with small improvements, but we believe the following two will have the largest impact: KRaft Significant improvements have been made to KRaft, an initiative that aims to absorb the complexity of coordination into the Kafka project itself, rather than leveraging Apache ZooKeeper. This is mostly relevant for customers who do not use any other solution that leverages Zookeeper. At the time of writing, KRaft (kicked off by KIP-500 and enriched by initiatives such as KIP-630) is not yet ready for production, but perhaps by the time you read this article, a production-ready version is available in a 3.x release. Big Cleanup Though not directly valuable to end-users, a large clean-up has been executed, resulting in the deprecation (not yet discontinuation) of old message formats V0 and V1, so perhaps already stop using these for new developments. Users are also discouraged from setting up Kafka 3 clusters on versions below Java 11. Conclusion Kafka 3.0 offers a clean slate and a good foundation for enabling new developments. On the other hand the 3.0 version is clearly a .0 version and its main improvement is not ready for production usage, so the 2.x branch will stay in the spotlight a little longer. Cloudera will continue to keep developing and evaluating versions internally and as soon as a 3.x version becomes valuable for production, a Cloudera announcement will follow shortly! For enthusiasts who want to know the details, the full list of improvements and changes can be found here. Authored by Dennis Jaheruddin (Global Streaming SME Lead) & Joseph Niemiec (Product Manager Kafka)

ShounenG · ‎08-23-2021

Hi @blueb 感谢您提供的宝贵的反馈！基本上 Flink 连接 Kafka 也是遵照常规 Java 项目使用 Kafka 的模式，您可以参考此链接了解常规 Java client 与 Kafka 连接时的主要选项。您的非 CM 管理的 Kafka 集群若是未启用认证的话，应该属于"Unsecured"。我在 Cloudera 的官方 Github 上找到了一个 Flink ↔ Kafka 的 demo 项目，您可以参考其中的job.properties。另外这个项目还有连接 secure Kafka 的 demo 项目，其中有配置连接 Kafka 的部分。您可以看到 job.properties 文件中定义了: kafka.security.protocol=SASL_SSL 这个 SASL_SSL 的含义是: 使用 SASL/PLAIN (CDP 中的 Kafka 开启 Kerberos 认证参考此链接) 作为认证方式，并使用 SSL/TLS 作为数据传输方式(也就是除了配置了认证之外，还在CM UI中Enable TLS/SSL for Kafka Broker)。参考: Confluent 官方文档。如果传输方式没有Enable TLS/SSL，那么 Kafka Broker 的日志 (/var/log/kafka/server.log) 中，您会看到 listeners = SASL_PLAINTEXT；如果开启了Kerberos 认证 (或LDAP、PAM等其他SASL认证) 又Enable TLS/SSL for Kafka Broker，那么您会看到 listeners = SASL_SSL。另外，值得注意的是，您可以同时配置多个listener，也就是listeners = SASL_PLAINTEXT 和 listeners = SASL_SSL 可以同时存在。另外此 demo 代码也有一个 YouTube 视频演示。以上信息供您参考。

DennisJaheruddi · ‎08-22-2021

In past years many companies have come up with (Hybrid) Cloud strategies, and there is no shortage of recommendations for when and how to use Cloud providers such as AWS, Azure, GCP, and others. However, most advice is geared towards a generic setting, and may not translate well to a Big Data context. And here is the main reason why: BIG Data requires BIG resources that have a BIG cost impact This does not mean that any specific vendor is expensive, and it is not limited to a single deployment form such as PaaS or SaaS. It comes down to the fundamental economics that infrastructure provides must cover the costs they make, and if you use their solution for a big amount of data, these costs will be larger than for a small lightweight app. In this context, hopefully, the following advice makes sense. Avoid Network Usage Perhaps the largest difference between a Big Data architecture, and a regular architecture is that it can make a large difference whether data is processed close to the source. This is often casually referred to as 'Data Gravity'. If you get a large volume of data from a source in one location (E.g. On-premise or in a specific Cloud zone) it can be a very good idea to process the data close to the source. By filtering, aggregating, or otherwise reducing data before sending it across zones, data transfer fees can be avoided. Fun fact: Buying a 10TB hard drive can cost you less than downloading 10TB from the Cloud. Of course, the costs are less visible for uploading data to the cloud, but if your data comes from on-premises, you will easily find yourself needing to expand the internet capacity once the data volumes grow. Flexible Cloud Infrastructure vs Economical On-Prem infrastructure This may come as a shock, but even hyperscale Cloud providers such as AWS, Azure and GCP do not claim their infrastructure is cheaper. The careful observer can note that instead Cloud providers indicate that the TCO should be lower when using Cloud, rather than purely the infrastructure costs. Though this can certainly be true, one should realize that a cost analysis for a few lightweight applications will be different than for a heavy platform. Cost My personal rule of thumb is that the break-even point for a cheaper TCO is around 30% utilization. and though it depends on your company context and exact solution, I have never seen anyone assume a break-even point outside the 15%-60% range, which leads to the following disturbing point: If your server utilization is near 100%, the TCO will increase when going to the Cloud This is completely independent of the solution. The only exceptions that I found so far are if your on-premise licenses cost several times more than the underlying hardware, or if it is actually possible to shut down a poorly utilized data center. Value Of course, the cost is often not the key objective when starting a Cloud journey. There are many reasons for using the Cloud, such as infrastructure flexibility and ease of use. The value of these may in fact outweigh the total cost of any scenario. In a (Hybrid) cloud strategy, the trick is to identify the key value points and meet these without incurring explosive costs. For example: Do some use cases have low average and huge peak loads? --> These seem like excellent candidates for the Cloud. Do some process steps, such as Development, require more flexibility? --> These can also be excellent candidates for the Cloud. Of course, this last point only applies if you have a consistent solution in the Cloud and On-premises, such as the Cloudera Data Platform. In short, the key takeaway is: Identify use cases and process steps that benefit from flexibility, and bring these to the Cloud while keeping the TCO under control. Stay in control of your data A third reason why Cloud strategies do not automatically cover Big Data is that it is all about Data! In order to stay in control, it is important to ensure both accessibility and security. Security A general security architecture will think about infrastructure level security, perhaps even file-level security. However, in a big data world, we must go one level deeper, and really nail down data level security. It has become very common that business units may only see a limited set of rows (e.g. from their own unit), or columns (e.g. not sensitive data) from a single table. So it is great that one can define a security policy on a Cloud object storage bucket, or perhaps files within this, but that is really not sufficient anymore in this ever-changing world. It is no longer sufficient to give permissions on the file or table level, security MUST be applied on rows and columns As a result one can choose between two solutions: Putting the data 'inside' a database solution so it cannot get accessed directly. This is what most Cloud-specific (and classical) on-premise database solutions do. However, in a Big Data context, this not only inflates run costs, but it also means there is no way to get data out except through the database engine. This makes integration possibilities limited and significantly increases the difficulty of ever leaving the solution behind. Using open formats for the data, letting it live in a Cloud storage with proper and detailed security policies in place. An open solution such as the Cloudera Data Platform can facilitate this. Accessibility Especially when working with structured data (tables that can be queried with SQL), it can be very tempting to put the data in a database and assume it will always be extracted using the engine. However, especially in the Big Data world, the load that data processing solutions would put on these engines would be so large, that often direct data access is preferred. Rather than sending a query to the engine, the solution directly reads the files from the (Cloud) storage layer, gaining much speed and cost-efficiency. In a Big Data world, queries should NOT always need to hit the query engine Therefore it is really recommended to work with a (Database) solution that can write directly to accessible files on the Cloud native storage. Conclusion Though far from exhaustive, this hopefully illustrates that when making a (Hybrid) Cloud strategy, it is important to realize that there are some key challenges to overcome when working with Big Data. The Cloudera Data Platform makes things easier from a technology perspective, and this article has hopefully at least identified the points may require close attention. Of course, do reach out to your Cloudera contact when there are more detailed questions on how to enrich or fulfill the IT strategy of your company.

VidyaSargur · ‎08-19-2021

@midee, were you able to implement the suggestions? Has the reply helped resolve your issue? If so, can you please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future?

Online	Offline
Last Visited	‎12-15-2021 03:18 AM

Member Since	‎01-07-2019 03:54 AM
Last Visited	‎12-15-2021 03:18 AM
Posts	220
Kudos received	31

Cloudera Community

Re: 在启用kerberos的集群flink程序如何连接集群外未启用认证的kafka

Re: Attribute validation against MSSQL database

Re: Put array with Dates on nifi flowfile

Re: NiFi templates don't include all controller se...

Re: Concatenations of Multiple Attributes in Nifi

Re: Issue creating/accessing hive external table w...

Re: Apache Nifi: I want to compare or find missi...

Re: Mismatch in the count between reading data in ...

Re: Read files in s3 bucket one by one using fetch...

Re: NiFi-1.14.0 failure to start: Unable to start ...

Re: How do I download the latest version of Ambari...

The arrival of Apache Kafka 3.0 by Cloudera

Re: 在启用kerberos的集群flink程序如何连接集群外未启用认证的kafka

Cloud in a Big Data world

Re: Attribute validation against MSSQL database