Member since
09-01-2020
321
Posts
24
Kudos Received
10
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 2136 | 10-22-2024 11:56 AM | |
| 3444 | 09-23-2024 11:55 PM | |
| 3495 | 09-23-2024 11:35 PM | |
| 1714 | 03-04-2024 07:58 AM | |
| 3374 | 11-15-2023 07:50 AM |
02-06-2025
12:37 AM
Hello @thoufeeq1218, We understand that you have configured spark.blockManager.driver.port and spark.blockManager.port, but Spark may still attempt to use random ports due to the following reasons: - Why is Spark using random ports? Spark uses additional ports beyond BlockManager for communication between the driver and executors. The random port seen (59698) is from the ephemeral port range (1024–65535) and could be assigned due to: spark.driver.port (default: random) spark.executor.port (default: random) - How to restrict Spark to specific ports? Explicitly set spark.driver.port to ensure the driver listens on a fixed port: --conf spark.driver.port=21800 Ensure spark.blockManager.port is available If 21700 is occupied, Spark will fall back to a random port. - Understanding spark.port.maxRetries If spark.port.maxRetries > 0 (default: 16), Spark will try additional ports within the ephemeral range. If spark.port.maxRetries = 0, Spark will fail immediately if the specified port is unavailable. - Executors and Dynamic Ports Executors start/stop dynamically and may request random ports. If you must prevent Spark from using random ephemeral ports, use the following settings: --conf spark.driver.port=21800 --conf spark.blockManager.driver.port=21750 --conf spark.blockManager.port=21700 --conf spark.executor.port=21810 --conf spark.port.maxRetries=0 These settings can be applied at the job level or in the Spark configuration file. Note: If a port is already used by a running job, a new job may fail due to a port conflict. If you found this response assisted with your query, please take a moment to log in and click on KUDOS 🙂 & ”Accept as Solution" below this post. Thank you.
... View more
10-22-2024
11:56 AM
2 Kudos
Hello @amru, We need the complete log file or a relevant log snippet to understand why Kafka is failing after enabling Ranger. But you should consider checking, Check Ranger’s audit logs to see if any access requests were denied, which could help you pinpoint the issue and the specific resource where permissions are lacking. You should check the Ranger logs for any errors or issues. Ensure that Kafka policies are synced successfully. If you are using AD/LDAP, ensure that all users are properly synced with Ranger. Thank you
... View more
09-25-2024
06:02 AM
1 Kudo
Hello @Israr , You should verify that both Kafka and Kafka Connect are running and in a healthy state. Thank you.
... View more
09-23-2024
11:55 PM
1 Kudo
Hello @ayukus0705 , A] I am looking for an option where we can directly read those hexadecimal escape sequences(i.e., ReportV10\x00\x00\x00\x00\x02\x02\x02) as it is in my spark dataframe. >> You will have to make sure that escape sequences are considered as raw binary data or strings without any spontaneous decoding or transformation. Following is an example to read as a binary: val df = spark.read.format("binaryFile").load("path of your file here") B] Alternatively, you can use the HBase Spark connector to load the data as binary. When using the HBase Spark connector, there is no need for any automatic decoding or transformation into the required format. Refer the following docs for more details: Private Cloud: https://docs.cloudera.com/cdp-private-cloud-base/7.1.9/accessing-hbase/topics/hbase-example-using-hbase-spark-connector.html? Public Cloud: https://docs.cloudera.com/runtime/7.2.18/accessing-hbase/topics/hbase-using-hbase-spark-connector.html? If you found this response assisted with your query, please take a moment to log in and click on KUDOS 🙂 & ”Accept as Solution" below this post. Thank you.
... View more
09-23-2024
11:35 PM
2 Kudos
Hello @Israr , Cloudera provides the easiest ways and options to configure such setups through Cloudera Manager (CM) and Streams Messaging Manager (SMM) [1] [1] https://docs.cloudera.com/cdp-private-cloud-base/7.1.8/monitoring-kafka-connect/topics/smm-creating-a-connector.html You can configure this pipeline with HDFS or Stateless NiFi Source and Sink connectors [2] & [3]. [2] Public Cloud: https://docs.cloudera.com/runtime/7.2.18/kafka-connect/topics/kafka-connect-connector-nifi-stateless.html [3] PvC: https://docs.cloudera.com/cdp-private-cloud-base/latest/kafka-connect/topics/kafka-connect-connector-nifi-stateless.html HDFS: https://docs.cloudera.com/cdp-private-cloud-base/7.1.9/kafka-connect/kafka-connect.pdf If you found this response assisted with your query, please take a moment to log in and click on KUDOS 🙂 & ”Accept as Solution" below this post. Thank you.
... View more
03-06-2024
02:58 AM
1 Kudo
Hello @BrianChan, We should check the consumer offset topic (__consumer_offsets) health using the Kafka describe command in such issues. And check min.insync.replicas setting of this topic in describe command output. It should be less than or qual to topic ISR. For example: If the topic has replication factor 3 then min ISR should be 2 or 1 for failover. If you found this response assisted with your query, please take a moment to log in and click on KUDOS 🙂 & ”Accept as Solution" below this post. Thank you.
... View more
03-06-2024
02:44 AM
2 Kudos
Hello @hegdemahendra, 1) Please refer to the following article to connect Kafka from Nifi: https://community.cloudera.com/t5/Community-Articles/Integrating-Apache-NiFi-and-Apache-Kafka/ta-p/247433 2) Also, to isolate the issue you can try to connect Kafka from the same settings from nifi node using the Kafka command Please let us know if you still have any questions regarding the same or facing any issues. We will be happy to assist you with it. If you found this response assisted with your query, please take a moment to log in and click on KUDOS 🙂 & ”Accept as Solution" below this post. Thank you.
... View more
03-04-2024
07:58 AM
1 Kudo
Hello @steinsgate, The CDP Private Cloud Data Services will use dedicated OCP only so it doesn’t affect other services. If you found this response assisted with your query, please take a moment to log in and click on KUDOS 🙂 & ”Accept as Solution" below this post. Thank you.
... View more
12-29-2023
12:48 AM
1 Kudo
Hello @StanislavJ , The Linux kernel parameter, vm.swappiness, is a value from 0-100 that controls the swapping of application data (as anonymous pages) from physical memory to virtual memory on disk. The higher the value, the more aggressively inactive processes are swapped out from physical memory. The lower the value, the less they are swapped, forcing filesystem buffers to be emptied. On most systems, vm.swappiness is set to 60 by default. This is not suitable for Hadoop clusters because processes are sometimes swapped even when enough memory is available. This can cause lengthy garbage collection pauses for important system daemons, affecting stability and performance. Cloudera recommends that you set vm.swappiness to a value between 1 and 10, preferably 1, for minimum swapping on systems where the RHEL kernel is 2.6.32-642.el6 or higher. To view your current setting for vm.swappiness, run: #cat /proc/sys/vm/swappiness To set vm.swappiness to 1, run: #sudo sysctl -w vm.swappiness=1 Also, To give an overview, swapping alerts are generated in Cloudera Manager when host swapping or role process swap usage exceeds a defined threshold. The warning threshold of "500 MiB" will mean that any swap usage beyond this on a given host will generate an alert and critical if set to any would generate an alert even if a small amount of swapping occurs. The swap memory usage threshold value can be set at the "host" level or at the "process/service" level. >> To set the threshold at the process level settings can be done as follows: From CM UI >> Clusters >> yarn >> Configuration >> search for "Process Swap Memory Thresholds" >> (For resource manager) Warning and Critical >> Select Specify >> Specify value here (You can set the values in Bytes/KB/MB/GB) >> Save changes You can increase the value and then further suggest you monitor the cluster for the swap usage and adjust the values accordingly. If you found this response assisted with your query, please take a moment to log in and click on KUDOS 🙂 & ”Accept as Solution" below this post. Thank you, Babasaheb Jagtap
... View more
11-15-2023
07:50 AM
1 Kudo
Hello @one4like , Pushing every local file of a job to HDFS will cause issues, especially in larger clusters. Local directories are used as scratch location. Spills of mappers are written there and moving that over to the network will have performance impacts. The local storage of the scratch files and shuffle files is done exactly to prevent this. It also has security impacts as the NM now pushes the keys for each application on to a network location which could be accessible for others. A far better solution is to use the fact that the value of yarn.nodemanager.local-dirs can point to multiple mount points and thus spreading the load over all mount points. So the answer is NO. local-dirs must contain a list of local paths. There's an explicit check in code which only allows local FS to be used. See here: https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java#L224 Please note that an exception is thrown when a non local file system is referenced. If you found this response assisted with your query, please take a moment to log in and click on KUDOS 🙂 & ”Accept as Solution" below this post. Thank you. Bjagtap
... View more