About sunile_manjee

VidyaSargur · ‎08-06-2021

@jpcastillo2, as this is an older post, you would have a better chance of receiving a resolution by starting a new thread. This will also be an opportunity to provide details specific to your environment that could aid others in assisting you with a more accurate answer to your question. You can link this thread as a reference in your new post.

VidyaSargur · ‎06-23-2021

@Sunny93 as this is an older post, you would have a better chance of receiving a resolution by starting a new thread. This will also be an opportunity to provide details specific to your environment that could aid others in assisting you with a more accurate answer to your question. You can link this thread as a reference in your new post.

MattWho · ‎06-07-2021

@myuintelli2021 Assisting on your new post here: https://community.cloudera.com/t5/Support-Questions/Nifi-untrusted-proxy-caused-by-Untrusted-Proxy-Exception/m-p/317796/highlight/false#M227327 Your choice of user authentication does not matter here. Authentication and Authorization processes are handled independently of one another. The Authentication of users/clients results in a string which is evaluated against identity mapping properties and then passed to the configured authorizer for authorization. Your exception points and missing /proxy authorization for your node strings. Hope this helps, Matt

sunile_manjee · ‎01-20-2021

Credits to @mbawa (Mandeep Singh Bawa) who co-built all the assets in this article. Thank you! We (Mandeep and I) engaged on a customer use case where Cloudera Data Engineering (Spark) jobs were triggered once a file lands in S3 (details on how to trigger CDE from Lambda here). Triggering CDE jobs is quite simple; however, we needed much more. Here are a few of the requirements: Decoupling Ingestion Layer / Processing Layer Decoupling apps (sender) from Spark Apps can send and forget payloads without the burden of configuring Spark (#number of executors, memory/cpu, etc), the concern of Spark availability (Upgrades, resources availability, etc), or application impacts from CDE API changes Real-time changes to where CDE jobs are sent (Multi CDE) Monitor job status and alerts Monitoring job run times and alerts which may be out-of-spec runtimes Failover to Secondary CDE Throttling Authentication It may look as though we are trying to make NiFi into an orchestration engine for CDE. That's not the case. Here we are trying to fill some core objectives and leveraging capabilities within the platform to accomplish the above-stated task. CDE comes with Apache Airflow, a much richer orchestration engine. Here we are integrating AWS triggers, multiple CDE clusters, monitoring, alerting, and single API for multi clusters. Artifacts NiFi CDE Jobs Pipeline Workflow Streams Messaging Cluster (Kafka) CDF clusters (NiFi) Heavy usage of NiFi parameters High-Level WorkFlow At a high level, the NiFi workflow does the following: Exposes a single rest endpoint for CDE job submission CDE workload balancing between multiple CDE clusters If only a single CDE cluster is available, it will queue jobs until compute bandwidth is available Queue jobs if CDE clusters are too busy Jobs will re-run if set in the queue If the number of retry for a job spec is greater than 3 (parameterized), an alert will be triggered Monitor jobs from start to finish Alert if job Fails Run time out of predetermined max run time i.e. jobs run for 10 minutes and max run time for jobs is set to 5 minutes Setup The following NiFi parameters will be required api_token (CDE Token, more on this later) Set to ${cdeToken} job-runtime-threshold-ms Max run time a job should run before an alert is triggered kbrokers Kafka brokers ktopic-fail Kafka topic: cde-job-failures ktopic-inbound-jobs Kafka topic: cde-jobs ktopic-job-monitoring Kafka topic: cde-job-monitoring ktopic-job-runtime-over-limit Kafka topic: cde-job-runtime-alert ktopic-retry Kafka topic: cde-retry username CDE Machine user password CDE machine user password primary-vc-token-api CDE token api (more on this later) primary_vc_jobs_api CDE Primary cluster jobs api (more on this later) secondary-vc-available Y/N If secondary CDE cluster is available, set to Y, else N secondary_vc_jobs_api CDE secondary cluster jobs API if the secondary cluster is available run_count_limit Max number of concurrent running jobs per CDE cluster i.e. 20 wait-count-max Max retry count. If a job is unable to be submitted to CDE (ie due to be too busy), how many times should NiFi retry before adding job to Kafka ktopic-fail topic i.e. 5 start_count_limit Max number of concurrent starting jobs per CDE cluster i.e. 20 Note: When you run the workflow for the first time, generally the Kafka topics will be automatically created for you. Detailed WorkFlow Once a CDE job spec is sent to NiFi, NiFi does the following: Write job spec to Kafka ktopic-inbound-jobs (nifi parameter) topic Pull jobs from Kafka ktopic-inbound-jobs (nifi parameter) topic New jobs- ktopic-inbound-jobs (nifi parameter) topic retry jobs- ktopic-retry (nifi parameter) topic Monitoring jobs- ktopic-job-monitoring (nifi parameter) topic Fetch CDE API tokens Check if the primary cluster current run count is less than run_count_limit (nifi parameter) Check if the primary cluster current starting count is less than start_count_limit (nifi parameter) If run or start counts are not within limit, retry the same logic on the secondary cluster (if available, secondary-vc-available) If run/start counts are within limit, job spec will be submitted to CDE If run/start counts are not within limit for primary and secondary CDE and the number of retries is less than wait-count-max (nifi parameter), job spec will be written to a Kafka ktopic-retry topic (nifi parameter) Monitoring NiFi will call CDE to determine the current status of Job ID (pulled from ktopic-job-monitoring) If the job end is successful, nothing more will happen here. If the job ends with failure, job spec will be written to Kafka ktopic-fail topic If the job is running and run time is less than job-runtime-threshold-ms Write job spec to ktopic-job-monitoring Else send an alert (nifi parameter) CDE APIs To get started, CDE primary and secondary (if available) cluster API details are needed in NiFi as parameters: To fetch the token API, click the pencil icon: Click on Grafana URL: The URL will look something like this: https://service.cde-zzzzzz.moad-aw.aaaaa-aaaa.cloudera.site/grafana/d/sK1XDusZz/kubernetes?orgId=1&refresh=5s Set the NiFi parameter primary-vc-token-api to the first part of the URL: service.cde-zzzzzz.moad-aw.aaaaa-aaaa.cloudera.site Now get the Jobs API for both primary and secondary (if available). For a virtual cluster, Click the pencil icon Click Jobs API URL to copy the URL The jobs URL will look something like this: https://aaa.cde-aaa.moad-aw.aaa-aaa.cloudera.site/dex/api/v1 Fetch the first part of the URL and set the NiFi parameter primary_vc_jobs_api. Do the same steps for secondary_vc_jobs_api aaa.cde-aaa.moad-aw.aaa-aaa.cloudera.site Run a CDE job Inside of the NiFi workflow, there is a test flow to verify the NiFi CDE jobs pipeline works: To run the flow, inside of InvokeHTTP, set the URL to one of the NiFi nodes. Run it and if the integration is working successfully; you will see a job running in CDE. Enjoy! Oh, by the way, I plan on publishing a video walking through the NiFi flow.

suraj143 · ‎11-11-2020

You can try this ${message:unescapeXml()} This function unescapes a string containing XML entity escapes to a string containing the actual Unicode characters corresponding to the escapes.

sunile_manjee · ‎11-09-2020

Recently I ran into a scenario requiring to connect my Spark Intellij IDE to Kafka DataHub. I'm not going to claim the status of a pro at IDE secure setup. Therefore for novices in the security realm alike, they may find this article useful This article will go through steps setting up an Spark Scala IDE (Intellij) (with a supplied working code example) to connect securely to a Kafka DataHub over SASL_SSL protocol using PLAIN SASL mechanism. Artifacts https://github.com/sunileman/spark-kafka-streaming Scala Object https://github.com/sunileman/spark-kafka-streaming/blob/master/src/main/scala/KafkaSecureStreamSimpleLocalExample.scala The scala object accepts 2 inputs Target Kafka topic Kafka broker(s) Prequequites Kafka DataHub Instances Permission setup on Ranger to be able to read/write from Kafka Intellij (or similar) with the Scala plugin installed Workload username and password TrustStore Andre Sousa Dantas De Araujo did a great job explaining (very simply) how get the certificate from CDP and create a truststore. Just a few simple steps here https://github.com/asdaraujo/cdp-examples#tls-truststore I stored it here on my local machine which is referenced in the spark scala code ./src/main/resources/truststore.jks JaaS Setup Create a jaas.conf file KafkaClient { org.apache.kafka.common.security.plain.PlainLoginModule required username="YOUR-WORKLOAD-USER" password="YOUR-WORKLOAD-PASSWORD"; }; I stored mine here which is referenced in the spark scala code ./src/main/resources/jaas.conf Spark Session (Scala Code) Master is set to local set spark.driver.extraJavaOptions and spark.executor.extraJavaOptions to the location of your jaas.conf set spark.kafka.ssl.truststore.location to the location of your truststore val spark = SparkSession.builder .appName("Spark Kafka Secure Structured Streaming Example") .master("local") .config("spark.kafka.bootstrap.servers", kbrokers) .config("spark.kafka.sasl.kerberos.service.name", "kafka") .config("spark.kafka.security.protocol", "SASL_SSL") .config("kafka.sasl.mechanism", "PLAIN") .config("spark.driver.extraJavaOptions", "-Djava.security.auth.login.config=./src/main/resources/jaas.conf") .config("spark.executor.extraJavaOptions", "-Djava.security.auth.login.config=./src/main/resources/jaas.conf") .config("spark.kafka.ssl.truststore.location", "./src/main/resources/truststore.jks") .getOrCreate() Write to Kafka The data in the dataframe is hydrated via csv file. Here I will simply read the dataframe and write it back out to a Kafka topic val ds = streamingDataFrame.selectExpr("CAST(id AS STRING)", "CAST(text AS STRING) as value") .writeStream.format("kafka") .outputMode("update") .option("kafka.bootstrap.servers", kbrokers) .option("topic", ktargettopic) .option("kafka.sasl.kerberos.service.name", "kafka") .option("kafka.ssl.truststore.location", "./src/main/resources/truststore.jks") .option("kafka.security.protocol", "SASL_SSL") .option("kafka.sasl.mechanism", "PLAIN") .option("checkpointLocation", "/tmp/spark-checkpoint2/") .start() .awaitTermination() Run Supply JVM option, provide the location of the jaas.conf -Djava.security.auth.login.config=/PATH-TO-YOUR-jaas.conf Supply the program arguments. My code takes 2, kafka topic and Kafka broker(s) sunman my-kafka-broker:9093 That's it! Run it and enjoy secure SparkStreaming+Kafka glory

muslihuddin · ‎11-03-2020

Hi, went through my testing again. Unfortunately, I missed the step where I need to change/add the parameters on my commandline. Change my TeraGen, TeraSort and TeraValidate parameter and got better results TeraGen: 1 min 57 sec TeraSort: 22min 55sec TeraValidate: 1 min 23sec. Thank you very much for your writeup again.

VidyaSargur · ‎10-12-2020

Hi @Kumar78, as this is an older post, you would have a better chance of receiving a resolution by starting a new thread. This will also be an opportunity to provide details specific to your environment that could aid others in assisting you with a more accurate answer to your question.

ashinde · ‎10-12-2020

@sunile_manjee I am not very familiar with AWS ELB but you can try to use HandleHttpRequest and HandleHttpResponse processors and check if it serves your use case

sunile_manjee · ‎09-11-2020

Recently I was engaged in a use case where CDE processing was required to be triggered once data landed on s3. The s3 trigger in AWS would be via a Lambda function. As the files/data land in s3, an AWS Lambda function would be triggered to then call CDE to process the data/files. Lambda functions at trigger time include the names and locations of the files the trigger was executed upon. The file locations/names would be passed onto the CDE engine to pick up and process accordingly. Prerequisites to run this demo AWS account s3 Bucket Some knowledge of Lambda CDP and CDE Artifacts AWS Lambda function code https://github.com/sunileman/spark-kafka-streaming/blob/master/src/main/awslambda/triggerCDE.py CDE Spark Job, main class com.cloudera.examples.SimpleCDERun Code for class com.cloudera.examples.SimpleCDERun https://github.com/sunileman/spark-kafka-streaming Prebuilt jar https://sunileman.s3.amazonaws.com/CDE/spark-kafka-streaming_2.11-1.0.jar Processing Steps Create a CDE Job (Jar provided above) Create a Lambda function on an s3 bucket (Code provided above) Trigger on put/post Load a file or files on s3 (any file) AWS Lambda is triggered by this event which calls CDE. The call to CDE will include the locations and names of all files the trigger was executed upon CDE will launch, processing the files, and end gracefully It's quite simple. Create a CDE Job Name: Any Name. I called it testjob Spark Application: Jar file provided above Main Class: com.cloudera.examples.SimpleCDERun Lambda Create an AWS Lambda function to trigger on put/post for s3. The lambda function code is simple. It will call CDE for each file posted to s3. Lambda function provided in the artifacts section above. The following are the s3 properties: Trigger CDE Upload a file to s3. Lambda will trigger the CDE job. For example, I uploaded a file test.csv to s3. Once the file was uploaded, Lambda calls CDE to execute a job on that file Lambda Log The first arrow shows the file name (test.csv). The second arrow shows the CDE JobID, which in this case returned the number 14. In CDE, Job Run ID: 14 In CDE stdout logs show that the job received the location and name of the file which Lambda was triggered upon. As I said in my last post, CDE is making things super simple. Enjoy.

Online	Offline
Last Visited	‎05-25-2022 10:07 AM

Member Since	‎05-30-2018 10:40 PM
Last Visited	‎05-25-2022 10:07 AM
Posts	1,322
Kudos received	713

Cloudera Community

Re: Iterate over ADLS files using spark?

Re: Install NiFi CA service post nifi cluster inst...

Re: Which storage format is optimum for training m...

Re: Ambari custom alert failing

Re: df.cache() is not working on jdbc table

Re: Hortonworks sandbox 2.4 root password does not...

Re: describe phoenix table to find primary key

Re: NiFi Untrusted proxy

Using NiFi for CDE Jobs Pipeline

Re: nifi regex replace special characters

Connecting Spark IDE Securely to Kafka DataHub

Re: TeraGen, TeraSort, and TeraValidate Performanc...

Re: NIFI: SelectHiveQL vs QueryDatabaseTable

Re: using AWS ELB with ListenHTTP on nifi

CDE Triggered By AWS Lambda