Member since
09-15-2018
61
Posts
6
Kudos Received
7
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1854 | 04-17-2020 08:40 AM | |
9361 | 04-14-2020 04:45 AM | |
1211 | 04-14-2020 03:12 AM | |
967 | 10-17-2019 04:47 AM | |
1351 | 10-17-2019 04:33 AM |
04-18-2020
06:27 AM
Hey @kaf, Thanks for reaching out to the Cloudera community. You can use "tail" command and then pipeline it to Kafka console producer if you want to read the whole file and then continue tailing for subsequently appended lines. $ tail -f -n +1 <filename> | kafka-console-producer --broker-list <Broker_Host>:9092 --topic <topic_name> Let me know if this helps.
... View more
04-17-2020
08:40 AM
Hey @sharathkumar13, Thanks for reaching out to the Cloudera community. >> You can refer to the mentioned Git Repo[1] for information on Kafka exporter for Prometheus in Kafka Manager . [1]https://github.com/danielqsj/kafka_exporter >> I would like to share information on SMM[2], Streams Messaging Manager is an operations monitoring and management tool from Cloudera that provides end-to-end visibility in an enterprise Apache Kafka environment. With SMM, you can gain clear insights about your Kafka clusters. You can understand the end-to-end flow of message streams from producers to topics to consumers. [2]https://docs.cloudera.com/csp/2.0.1/smm-overview/topics/smm-overview.html Let me know if this helps.
... View more
04-17-2020
07:06 AM
Hey @rishav1412, Thanks for reaching out to the Cloudera community. I don't think we have a single way/process/configuration in Kafka to stream data from all Social Media Platforms. Every social media platform have their own APIs/methods and policy on data streaming. If you want to stream data from Twitter you can use any of the mentioned ways/processes/services to send data from Twitter to Kafka topics. Twitter >> Kafka Connect(Kafka Connect Twitter) >> Kafka Topics Twitter >> Flume(org.apache.flume.source.twitter.TwitterSource) >> Kafka Topics Twitter >> NiFi(GetTwitter Processor) >> Kafka Topics Let me know if this helps.
... View more
04-17-2020
06:46 AM
Hey @Manoj690, Thanks for reaching out to the Cloudera community. You can execute a PUT request using the mentioned path "/connectors/<Connector_name>/config" to update the configuration for an existing connector. Also, pass a JSON object with the update parameter/s in the PUT request. Example request: PUT /connectors/<Connector_name>/config Accept: application/json { "flush.size": "100", "rotate.interval.ms": "1000" } Let me know if this helps.
... View more
04-15-2020
08:15 AM
Hey @saihadoop, Thanks for reaching out to the Cloudera community. After setting up the Cluster infra and installing CDH & CM, you can use Cloudera Manager API[1] to Backing Up and Restoring the Cloudera Manager Configuration of an existing Cluster to a New Cluster. [1]https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cm_intro_api.html#concept_dnn_cr5_mr Let me know if this helps. Cheers,
... View more
04-14-2020
08:44 AM
Hey @AndyTech, Thanks for reaching out to the Cloudera community. The commit-id mentioned here isn't related to any Kafka usage related terms such as 'commit offsets' or other terms. This commit id refers to the Kafka source from which it was built. It is not an error but just an info message. This doesn't impact Kafka client's functionality in any way. Let me know if this helps. Cheers,
... View more
04-14-2020
05:45 AM
Hey @AndyTech, Thanks for reaching out to the Cloudera community. This issue is due to the missing "kafka-python" module in your Python installation. You have to manually install the "kafka-python" module using the mentioned command in the edge node and all the hosts on which Spark job executes. $ pip install kafka-python
... View more
04-14-2020
05:31 AM
Hey @sharathkumar13, Thanks for reaching out to the Cloudera community. Can you clarify on this "Do we have options to do ?"? Are you looking to use Prometheus and Graffana to monitor Kafka Service?
... View more
04-14-2020
05:21 AM
Hey @no1, Thanks for reaching out to the Cloudera community. The Edge nodes have all the necessary libraries, client components and current configuration of the cluster required to communicate with the CDH Cluster. Setting Up Edge Node without using Cloudera Manager has various limitations and manual tasks. Also, it depends on the Service(Spark, Kafka, Hive) you are going to communicate from this Edge Node. As far I know Cloudera doesn't provide step-by-step documentation to perform this action. Configuring the Edge Node using CM helps you distribute all the required binaries, Cluster configurations and update any change/modification during upgrades/configurations change.
... View more
04-14-2020
04:45 AM
1 Kudo
Hey @GTA, Thanks for reaching out to the Cloudera community. "Required executor memory (1024), overhead (384 MB), and PySpark memory (0 MB) is above the max threshold (1024 MB) of this cluster!" >> This issue occurs when the total memory required to run a spark executor in a container (Spark executor memory -> spark.executor.memory + Spark executor memory overhead: spark.yarn.executor.memoryOverhead) exceeds the memory available for running containers on the NodeManager (yarn.nodemanager.resource.memory-mb) node. Based on the above exception you have 1 GB configured by default for a spark executor, the overhead is by default 384 MB, the total memory required to run the container is 1024+384 MB = 1408 MB. As the NM was configured with not enough memory to even run a single container (only 1024 MB), this resulted in a valid exception. Increasing the NM settings from 1251 to 2048 MB will definitely allow a single container to run on the NM node. Use the mentioned steps to increase "yarn.nodemanager.resource.memory-mb" parameter to resolve this. Cloudera Manager >> YARN >> Configurations >> Search "yarn.nodemanager.resource.memory-mb" >> Configure 2048 MB or higher >> Save & Restart. Let me know if this helps.
... View more
04-14-2020
03:43 AM
1 Kudo
Hey @Deep_live, Apologies, I'm unable to locate a cached/ archived CDH 5.x quickstart VM image.
... View more
04-14-2020
03:27 AM
Hey @ping_pong, Thanks for reaching out to the Cloudera community. Do you have TLS enabled for this CDH Cluster? What are the steps followed to add new Host to this CDH Cluster? > After installing all the required parcels/packages, have you started the Cloudera Manager Agent using the mentioned command? $ sudo service cloudera-scm-agent start
... View more
04-14-2020
03:12 AM
2 Kudos
Hey, The Cloudera Quickstart VM has been discontinued for CDH 5.x & 6.x by Cloudera. >> You can try a docker image of Cloudera available publicly on https://hub.docker.com/r/cloudera/quickstart or simply run below command to download this on docker enabled system. $ docker pull cloudera/quickstart
... View more
04-14-2020
03:08 AM
Hey, If you have an existing subscription for HDP products, try logging in using your existing HDP credentials. If not, try registering in the Cloudera portal. For learning and development purpose you can try using Hortonworks Sandbox.
... View more
11-01-2019
06:31 AM
1 Kudo
Hey, CSD Version: 2.3 & higher, I think. Regards, Ankit.
... View more
10-18-2019
05:52 AM
Hey, Thank you for sharing the outcome and the steps. Much appreciated. Regards.
... View more
10-17-2019
06:52 AM
Hey, This exception might encounter if jline version isn't in sync with the scala version. What is your current Scala version? Regards, Ankit.
... View more
10-17-2019
04:58 AM
Hey, Refer mentioned article[1] for step by step instruction on Installing Spark2 on Cloudera’s Quickstart VM. [1]https://blog.clairvoyantsoft.com/installing-spark2-on-clouderas-quickstart-vm-bbf0db5fb3a9 Please let me know if this helps. Regards, Ankit.
... View more
10-17-2019
04:53 AM
Hey, Refer mentioned Cloudera Documentation[1] on "Configuring the Flume Properties File". [1]https://docs.cloudera.com/documentation/enterprise/5-14-x/topics/cdh_ig_flume_config.html Please let me know if this helps. Regards, Ankit.
... View more
10-17-2019
04:47 AM
Hey, Optimizing your Kafka Cluster depends upon your cluster usage & use-case. Based on your main concern like throughput or CPU utilization or Memory/Disk usage, you need to modify different parameters and some changes may have an impact on other aspects. For example, if acknowledgments is set to "all", all brokers that replicate the partitions need to acknowledge that the data was written prior to confirming the next message needs to be sent. This will ensure data consistency but increase CPU utilization and network latency. Refer Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines) article[1] written by Jay Kreps(Co-founder and CEO at Confluent). [1]https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines Please let me know if this helps. Regards, Ankit.
... View more
10-17-2019
04:33 AM
Hey, Can you please try setting the SPARK_HOME env variable to the location indicated by the readlink command it launches pyspark and shows Spark 2.0 as the version? For Example: export SPARK_HOME=/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2 By setting SPARK_HOME to the Spark 2 lib folder instead, pyspark2 will then launch and show Spark 2.3.0.cloudera3 as the spark version. Please let me know if this helps. Regards, Ankit.
... View more
09-03-2019
04:03 AM
Hey, Can you try adding "advertised.listeners" in Kafka server.properties? advertised.listeners =PLAINTEXT://node1:6667
... View more
02-13-2019
09:17 PM
Hello Wert, As per the information provided, you have mentioned free space available. 16.09%(free: 3.2 GiB) of free space in /var/lib/cloudera-host-monitor. 35.25%(free: 4.9 GiB) of free space in /var/log/cloudera-scm-eventserver. 35.81%(free: 5.0 GiB) of free space in /var/log/cloudera-scm-alertpublisher. Thus explaining the alert for low disk space. The data in "/var/lib/cloudera-[host|service]-monitor" is the sum total of the working data for these respective services. Time-series metrics and health data - Time-Series Storage (firehose_time_series_storage_bytes - 10 GB default, 10 GB minimum) My suggestions: 1.) Change the default directory("/var/lib/cloudera-[host|service]-monitor") to some other location in your environment with enough space. >> Stop the Service(Service Monitor or Host Monitor). >> Save your old data and then copy the current directory to the new directory(optional)(Only if you need the old data). >> Update the Storage Directory configuration option (firehose.storage.base.directory) on the corresponding role configuration page. >> Start the Service Monitor or Host Monitor. 2.) If the data available in "/var/lib/cloudera-host-monitor" is not of much importance you can remove the data manually. But it's not a recommended step. Your Health statuses will be Unknown or Bad for a short time and you will lose all Charts in the UI while the timeseries store is rebuilt and repopulated (due to the fact that you are deleting ALL the historical metrics). But this shouldn't have an impact on any service. 3.) Either add more disk to the cluster or remove unused/unnecessary files available on the disk to free up some disk space. Regards.
... View more
02-11-2019
05:31 AM
Hello, With "firehose_time_series_storage_bytes" parameter in Cloudera Manager. We can control the approximate amount of disk space dedicated to storing time series and health data. Once the store has reached its maximum size, older data is deleted to make room for newer data. The disk usage is approximate because data is deleted only when the limit is reached. But configuring the log retention based on time seems unlikely. However, you can write a shell script to remove the data every 7 days from the "Service Monitor Storage Directory". By default, the data is stored in /var/lib/cloudera-service-monitor/ on the Service Monitor host. You can change this by modifying the Service Monitor Storage Directory configuration (firehose.storage.base.directory). But this step is not recommended by Cloudera.
... View more
02-11-2019
05:10 AM
Hello, Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [10.10.10.10:8485, 10.10.10.11:8485, 10.10.10.09:8485], stream=null)) As you can see above that instead of FQDN, the IP shows up, which means that when the JN is trying to talk to DNS to resolve the IP to the FQDN, it failed. It looks like your DNS have issues resolving the IPs. Please look into it.
... View more
02-07-2019
07:48 PM
Hello, Kafka Connect is included with Cloudera Distribution of Apache Kafka 2.0.0 but is not supported at this time. Cloudera recommends using Flume and Sqoop as proven solutions for batch and real-time data loading that complement Kafka's message broker capability[2]. Kindly refer mentioned link[1] for more information. [1]https://www.cloudera.com/documentation/kafka/latest/topics/kafka_known_issues.html#xd_583c10bfdbd326ba-590cb1d1-149e9ca9886--6fcb__section_ens_4bf_55 [2]https://blog.cloudera.com/blog/2014/11/flafka-apache-flume-meets-apache-kafka-for-event-processing/
... View more
02-05-2019
08:28 PM
Hello, 1. Writing Streaming Aggregation to File In order to use append mode with aggregations, you need to set an event time watermark (using "withWatermark"). Otherwise, Spark doesn't know when to output an aggregation result as "final". A watermark is a threshold to specify how long the system waits for late events. For example: df2 = df1.filter("code > 300").select("agent").withWatermark("timestamp", "2 minutes").groupBy("agent").count() 2. Reading from Kafka (Consumer) using Streaming You have to set SPARK_KAFKA_VERSION environment variable. When running jobs that require the new Kafka integration, set SPARK_KAFKA_VERSION=0.10 in the shell before launching spark-submit. # Set the environment variable for the duration of your shell session: export SPARK_KAFKA_VERSION=0.10 spark-submit arguments https://www.cloudera.com/documentation/spark2/latest/topics/spark2_kafka.html
... View more
02-04-2019
02:22 AM
Hello, Monitoring consumer group lag using Cloudera Manager seems unlikely as I tried configuring a chart to display the consumer group lag but couldn't generate the desired results. However, on further research, I came around a few GitHub projects that provide additional monitoring functionality. One among them is "Kafka Manager(Yahoo, Apache 2.0 License)", I think with this tool you can monitor consumer group lag. Please refer mentioned link[1] for more information on Kafka Manager. [1]https://github.com/yahoo/kafka-manager
... View more
02-03-2019
05:09 AM
1 Kudo
Hello, Loading data directly to Kafka without any Service seems unlikely. However, you can use execute a simple kafka console producer to send all your data to the kafka service. But if your requirement is to save data to HDFS you need to include a few more services along with Kafka. For example, C rawlers >> kafka console producer (or) Spark Streaming >> Flume >> HDFS As your requirement is to store the data in HDFS and not stream the data. I suggest you execute a Spark job, it will store your data to HDFS. Refer mentioned commands to execute a spark job to move data to HDFS. Initiate a spark-shell Execute the mentioned command in the Spark shell in the same order. val moveFile = sc.textFile("file:///path/to/Sample.log") moveFile.saveAsTextFile("hdfs:///tmp/Sample.log")
... View more