About sunile_manjee

sunile_manjee · ‎08-28-2020

The all new Cloudera Data Engineering Experience I recently had the opportunity to work with Cloudera Data Engineering to stream data from Kafka. It's quite interesting how I was able to deploy code without much worry about how to configure the back end components. Demonstration This demo will pull from the Twitter API using NiFi, write to payload to a Kafka topic named "twitter". Spark Streaming on Cloudera Data Engineering Experience CDE will pull from the twitter topic, extract the text field from the payload (which is the tweet itself) and write back to another Kafka topic named "tweet" The following is an example of a twitter payload. The objective is to extract only the text field: What is Cloudera Data Engineering? Cloudera Data Engineering (CDE) is a serverless service for Cloudera Data Platform that allows you to submit Spark jobs to an auto-scaling cluster. CDE enables you to spend more time on your applications, and less time on infrastructure. How do I begin with Cloudera Data Engineering (CDE)? Complete setup instructions here. Prerequisites Access to a CDE Some understanding of Apache Spark Access to a Kafka cluster In this demo, I use Cloudera DataHub, Streamings Messaging for rapid deployment of a Kafka Cluster on AWS An IDE I use Intellij I do provide the jar later on in this article Twitter API developer Access: https://developer.twitter.com/en/portal/dashboard Setting up a twitter stream I use Apache NiFi deployed via Cloudera DataHub on AWS Source Code I posted all my source code here. If you're not interested in building the jar, that's fine. I’ve made the job Jar available here. Oc t26, 2020 update - I added source code for how to connect CDE to Kafka DH available here. Users should be able to run the code as is without need for jaas or keytab. Kafka Setup This article is focused on Spark Structured Streaming with CDE. I'll be super brief here Create two Kafka topics twitter This topic is used to ingest the firehose data from twitter API tweet This topic is used post tweet extraction performed via Spark Structured streaming NiFi Setup This article is focused on Spark Structured Streaming with CDE. I'll be super brief here. Use the GetTwitter processor (which requires twitter api developer account, free) and write to the Kafka twitter topic Spark Code (Scala) Load up the Spark code on your machine from here: https://github.com/sunileman/spark-kafka-streaming Fire off a sbt clean and package A new jar will be available under target: spark-kafka-streaming_2.11-1.0.jar The jar is available here What does the code do? It will pull from the source Kafka topic (twitter), extract the text value from the payload (which is the tweet itself) and write to the target topic (tweet) CDE Assuming CDE access is available, navigate to virtual clusters->View Jobs Click on Create Job: Job Details Name Job Name Spark Application File This is the jar created from the sbt package: spark-kafka-streaming_2.11-1.0.jar Another option is to simply provide the URL where the jar available https://sunileman.s3.amazonaws.com/CDE/spark-kafka-streaming_2.11-1.0.jar Main Class com.cloudera.examples.KafkaStreamExample Arguments arg1 Source Kafka topic: twitter arg2 Target Kafka topic: tweet arg3 Kafka brokers: kafka1:9092,kafka2:9092,kafka3:9092 From here jobs can be created and run or simply created. Click on Create and Run to view the job run: To view the metrics about the streaming: At this point, only the text (tweet) from the twitter payload is being written to the tweet Kafka topic. That's it! You now have a spark structure stream running on CDE fully autoscaled. Enjoy

MattWho · ‎08-14-2020

@DivyaKaki The exception implies that the complete trust chain does not exist to facilitate a successful mutual TLS handshake between this NiFI and the target NiFi-Registry. NiFi uses the keystore and truststore configured in its nifi.properties and NiFi-Registry uses the keystore and truststore configured in its nifi-registry.properties files. Openssl can be used to public certificates for the complete trust chain: openssl s_client -connect <nifi-registry-hostname>:<port> -showcerts openssl s_client -connect <nifi-hostname>:<port> -showcerts for each public cert you will see: -----BEGIN CERTIFICATE----- MIIESjCCAzKgAwIBAgINAeO0mqGNiqmBJWlQuDANBgkqhkiG9w0BAQsFADBMMSAw HgYDVQQLExdHbG9iYWxTaWduIFJvb3QgQ0EgLSBSMjETMBEGA1UEChMKR2xvYmFs U2lnbjETMBEGA1UEAxMKR2xvYmFsU2lnbjAeFw0xNzA2MTUwMDAwNDJaFw0yMTEy MTUwMDAwNDJaMEIxCzAJBgNVBAYTAlVTMR4wHAYDVQQKExVHb29nbGUgVHJ1c3Qg U2VydmljZXMxEzARBgNVBAMTCkdUUyBDQSAxTzEwggEiMA0GCSqGSIb3DQEBAQUA A4IBDwAwggEKAoIBAQDQGM9F1IvN05zkQO9+tN1pIRvJzzyOTHW5DzEZhD2ePCnv UA0Qk28FgICfKqC9EksC4T2fWBYk/jCfC3R3VZMdS/dN4ZKCEPZRrAzDsiKUDzRr mBBJ5wudgzndIMYcLe/RGGFl5yODIKgjEv/SJH/UL+dEaltN11BmsK+eQmMF++Ac xGNhr59qM/9il71I2dN8FGfcddwuaej4bXhp0LcQBbjxMcI7JP0aM3T4I+DsaxmK FsbjzaTNC9uzpFlgOIg7rR25xoynUxv8vNmkq7zdPGHXkxWY7oG9j+JkRyBABk7X rJfoucBZEqFJJSPk7XA0LKW0Y3z5oz2D0c1tJKwHAgMBAAGjggEzMIIBLzAOBgNV HQ8BAf8EBAMCAYYwHQYDVR0lBBYwFAYIKwYBBQUHAwEGCCsGAQUFBwMCMBIGA1Ud EwEB/wQIMAYBAf8CAQAwHQYDVR0OBBYEFJjR+G4Q68+b7GCfGJAboOt9Cf0rMB8G A1UdIwQYMBaAFJviB1dnHB7AagbeWbSaLd/cGYYuMDUGCCsGAQUFBwEBBCkwJzAl BggrBgEFBQcwAYYZaHR0cDovL29jc3AucGtpLmdvb2cvZ3NyMjAyBgNVHR8EKzAp MCegJaAjhiFodHRwOi8vY3JsLnBraS5nb29nL2dzcjIvZ3NyMi5jcmwwPwYDVR0g BDgwNjA0BgZngQwBAgIwKjAoBggrBgEFBQcCARYcaHR0cHM6Ly9wa2kuZ29vZy9y ZXBvc2l0b3J5LzANBgkqhkiG9w0BAQsFAAOCAQEAGoA+Nnn78y6pRjd9XlQWNa7H TgiZ/r3RNGkmUmYHPQq6Scti9PEajvwRT2iWTHQr02fesqOqBY2ETUwgZQ+lltoN FvhsO9tvBCOIazpswWC9aJ9xju4tWDQH8NVU6YZZ/XteDSGU9YzJqPjY8q3MDxrz mqepBCf5o8mw/wJ4a2G6xzUr6Fb6T8McDO22PLRL6u3M4Tzs3A2M1j6bykJYi8wW IRdAvKLWZu/axBVbzYmqmwkm5zLSDW5nIAJbELCQCZwMH56t2Dvqofxs6BBcCFIZ USpxu6x6td0V7SvJCCosirSmIatj/9dSSVDQibet8q/7UK4v4ZUN80atnZz1yg== -----END CERTIFICATE----- Above is just example public cert from openssl command against google.com:443 You will need to make sure that every certificate in the chain when run agains NiFi UI is added to the truststore on NiFi-Registry and vice versa. You'll need to restart NiFi and NiFi-Registry before changes to your keystore or truststore files will be read in. Hope this helps, Matt

sunile_manjee · ‎08-11-2020

Image Courtesy: k9s I recently ran into a scenario where I needed to gather Hive logs on the new Data Warehouse Experience on AWS. The "old" way of fetching logs was to SSH into the nodes. Data Warehouse Experience is now deployed on K8s, so SSHing is off the table. Therefore a tool like K9s is key. This is a raw article to quickly demonstrate how to use K9s to fetch Data Warehouse Experience logs which are deployed on AWS K8s Prerequisites Data Warehouse Experience K9s installed on your machine AWS ARN (instructions provided below) AWS configure (CLI) pointing to your AWS env. Simply type AWS configure via CLI and point to the correct AWS subscription AWS ARN Your AWS ARN is required to successfully connect K9s to CDW(DW-X) On AWS, go to IAM > Users > Search for your user name: Click on your username to fetch the ARN: Kubeconfig Connecting to DW-X using K9s requires kubeconfig. DW-X makes this available under DW-X-> Environments > Your Environment > Show Kubeconfig. Click on the copy option and make the contents available within a file in your machine file system. For example, I stored the kubeconfig contents here: /Users/sunile.manjee/.k9s/kubeconfig.yml ARN To access K8s from K9s, your ARN will need to be added under Grant Access: K9s Now all is set up to connect to DW-X K8s using K9s. Reference kubeconfig.yml file when using K9s k9s --kubeconfig /Users/sunile.manjee/.k9s/kubeconfig.yml That's it. From here the logs are made available and a ton of other metrics. For more information on how to use K9s, see k9scli.io

MattWho · ‎06-05-2020

@LearnerAdmin It is not clear to me what you are asking when you say "add NIFI CA in authorities". Instructions on using the NiFi TLS toolkit can be found here: https://nifi.apache.org/docs/nifi-docs/html/toolkit-guide.html#tls_toolkit Using the Client/Server Tls Toolkit operational mode covered here: https://nifi.apache.org/docs/nifi-docs/html/toolkit-guide.html#client-server Will give you the ability to create a running NiFi CA authority "server" which will sign your NiFi node certificates created using the "client" mode. Thanks, Matt

jamiet · ‎06-04-2020

Probably worth pointing out that the behaviour of insertInto & saveAsTable can differ under certain conditions: https://towardsdatascience.com/understanding-the-spark-insertinto-function-1870175c3ee9 https://stackoverflow.com/questions/47844808/what-are-the-differences-between-saveastable-and-insertinto-in-different-savemod

ask_bill_brooks · ‎05-17-2020

Hi @kettle As this thread was marked 'Solved' in June of 2016 you would have a better chance of receiving a useful response by starting a new thread. This will also provide you with the opportunity to provide details specific to your use of the PutSQL processor and/or Phoenix that could aid others in providing a more tailored answer to your question.

kettle · ‎05-15-2020

hello! If I insert a string containing 'or "or, PutSQL to Phoenix will be return the grammatical errors, this should be how to solve?

VidyaSargur · ‎05-06-2020

@rahulsharma, The View solution in the original post option is more helpful when a discussion goes beyond one or two pages. For example, if somebody marks one of the posts as a solution in Page 2, clicking on View solution in the original post will bring you back to the first page under the original question. We understand that this can be confusing. Hopefully, this explanation should help. Regards, Vidya

sunile_manjee · ‎05-04-2020

The EFM (Edge Flow Manager) makes it super simple to write flows for MiNiFi to execute where ever it may be located (laptops, refineries, phones, OpenShift,etc). All agents (MiNiFi) are assigned an agentClass. Once the agent is turned on, it will phone home to EFM for run-time instructions. The run-time instructions are set at the Class level. Meaning all agents within a class, run the same instruction (flow) set. There can be 0 to many Classes. In this example, I will capture Windows Security Events via MiNiFi and ship them to NiFi over Site2Site Download MiNiFi MSI and set the classname. In this example, I set the classname to test6. This property is set at install time (MSI) or by going directly into minifi.properties. Also, notice the setting nifi.c2.enable=true. This informs MiNFi that run time flow instructions will be received from EFM. Start MiNiFi. MiNiFi can be configured to send data to multi endpoint (ie Kafka, NiFi, EventHub, etc). In this example, data will be sent to NiFi over S2S. On NiFi create an input port: Capture the port ID. This will be used in EFM later on: On EFM, open class test6. This is where we design the flow for all agents with their class is set to test6: To capture Windows events via MiNiFi, add ConsumeWindowsEventLog processor to the canvas: Configure the process to pull events. In this example, MiNiFi will listen for Windows Security Events: To send data from MiNiFi to NiFi, add Remote Process Group to the canvas. Provide a NiFi endpoint: Connect ConsumeWindowsEventLog processor to the Remote Process Group. Provide the NiFi Input Port ID captured earlier: Flow is ready to publish: Click on Publish. MiNiFi will phone home at a set interval (nifi.c2.agent.heartbeat.period). Once that occurs, MiNiFi will receive new run time flow instructions. At that time data will start flowing into NiFi. The EFM makes it super simple to capture Windows events and universally ship anywhere without the ball and chain of most agent/platform designs.

xiaobbai · ‎04-20-2020

how do you debug scripts? i use bash -x tpcds-setup.sh,but not find the error,and i use your method but it also report errors

Online	Offline
Last Visited	‎05-25-2022 10:07 AM

Member Since	‎05-30-2018 10:40 PM
Last Visited	‎05-25-2022 10:07 AM
Posts	1,322
Kudos received	713

Cloudera Community

Re: Iterate over ADLS files using spark?

Re: Install NiFi CA service post nifi cluster inst...

Re: Which storage format is optimum for training m...

Re: Ambari custom alert failing

Re: df.cache() is not working on jdbc table

Spark Structured Streaming example with CDE

Re: Integrate NIFI , NIFI Registry

How to use K9s to fetch metrics and logs for Cloud...

Re: Does NiFi CA generate self signed or signed ce...

Re: Write dataframe into parquet hive table ended ...

Re: NiFi Phoenix processor?

Re: Reading OpenData JSON and Storing into Phoenix...

Re: Atlas UI not coming up

How to consume Windows Event Logs via MiNiFi?

Re: hive testbench error when generating data