Member since
01-12-2022
81
Posts
10
Kudos Received
0
Solutions
01-21-2024
04:49 PM
Controller services referenced by the selected process group but outside its scope. For example, you create a controller service which is used by multiple PGs/flows including the one in question and you want it to be included in the flow definition output.
... View more
11-15-2023
02:06 AM
In the Cloudera Data Platform (CDP), there are two options for fault tolerance in Flink. One is a checkpoint, and the other is a savepoint [1]. Checkpoints are created automatically when enabled and are used for automatically restarting jobs in case of failure. While savepoints are manually triggered, they are always stored externally and are used for starting a "new" job with a previous internal state. Savepoints are what are used when you are performing steps like an upgrade.[2]
[1] Checkpoints vs. Savepoints
[2] Savepoint
CHECKPOINT ERROR RECOVERY
Further, at checkpoints, there are two levels of error recovery in Flink jobs. One is from the yarn side, and the other is from the Flink side. From the yarn side, we have yarn.maximum-failed-containers. This property will check from the Yarn Application master side whether 100 (default) containers have failed or not, and once 100 containers have failed, the Yarn application will fail, and it will not take into consideration any Flink job parameters.From the Flink side, if checkpointing is activated and the restart strategy has not been configured, the fixed-delay strategy is used with Integer.MAX_VALUE restart attempts. If you have checkpointing enabled, and maxNumberRestartAttempts=5. This restart strategy of Flink [3], should perform a restart of job up to 5 times. This is controlled by Flink and when 5 limit is reached it will fail the job. [3] Fixed Delay Restart Strategy Finally, to enable checkpoints and configure restart strategy on cluster level, customer can add the following properties to CM > Flink Service > Configuration > Flink Service Advanced Configuration Snippet (Safety Valve) for flink-conf/flink-conf.yaml:
execution.checkpointing.interval=<value>
restart-strategy=exponentialdelay or FixedDelayRestartBackoffTimeStrategy
config options for exponential delay strategy from Exponential Delay Restart Strategy
... View more
09-06-2023
02:36 AM
1 Kudo
LIMITATIONS OF TEMPLATES
If a user wants to change the flows using templates, i.e., downloading templates and uploading them back, then the following issues will occur:
The template will not preserve the state. It is not supposed to. The template is just a blueprint of the components, not their data.
New controller services will be created instead of using already existing ones, which is expected behavior in templates.
Templates will be unsupported in CFM 4.0/ NiFi 2.0 and there are not active improvements being done to them.
FLOW DEFINITION
The better solution in this case will be to use flow definitions instead of templates. The steps to use flow definitions are:
Right-click on Process Group > Click on Download Flow Definition > select the option to download flow JSON with or without external controller services. “External” here means services referenced inside the PG, but existing at a higher level.
Then, if you import that flow definition (JSON) that has references to external services, it will not create a Controller service rather it will automatically select them from the higher level as long as there is only one available choice with the same name and type. Meaning if the parent level has two services with the same name and type, then it won’t know which one to select.
... View more
Labels:
08-20-2023
06:09 PM
Issue
During the installation of the latest version of mlflow for the custom runtime engine in CDSW 1.10.x, it's still using cmladdon’s version rather than the version installed via pip.
For example, in one case, the error recorded was the following:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflict mlflow-cml-plugin 0.0.1 requires protobuf==3.19,4, but you have protobuf 4.24.0 which is incompatible.
onnx 1.12.0 requires protobuf<=3.20.1,>=3.12.2, but you have protobuf 4.24.0 which is incompatible.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflict mlflow-cml-plugin 0.0.1 requires protobuf==3.19,4, but you have protobuf 4.24.0 which is incompatible.
onnx 1.12.0 requires protobuf<=3.20.1,>=3.12.2, but you have protobuf 4.24.0 which is incompatible.
Cause
The cmladdon feature was introduced in CDSW 1.10.0. And its content has changed in 1.10.1. This cmladdon feature ships some Python libraries and is mounted in all workloads. We intend to do that for the user's convenience, i.e., to have some packages installed out of the box, such as CML or mlflow.
Within workloads, this folder is mounted as read-only. The same folder is mounted on all workloads.
We know that packages we deliver via this cmladdon feature take precedence over pretty much everything else ( packages installed to the runtime image, packages installed by the user).
Solution:
If the user wants to install and use a newer version of the above-mentioned typing-extensions package, the user can use a workaround. Any of the below steps will remove the cmladdon folder from workloads, which includes all packages coming from cmladdon:
Define the PYTHONPATH env variable on a workspace level (Site Administration > Runtime/Engine) or in each project, (Project Settings > Advanced) and give it any value ( empty value is also acceptable). This will override our code that injects the cmladdon folder onto the PYTHONPATH.
SSH onto the VFS pod. From there one can delete the content of cmladdon, as the VFS pod has to write access to this folder. This workaround has to be repeated if the cluster is ever upgraded to a newer CDSW version.
At the start of every Python workload, modify the list sys.path such the entry corresponding to cmladdon is removed or moved to the end of the list. The latter case will keep the packages in the syspath, but user-installed packages will take precedence.
... View more
07-09-2023
08:31 PM
2 Kudos
When starting NiFi, it uses the default staging path which is /home/NiFi/. This path needs the Execute permissions for NiFi binaries to avoid NiFi from failing to starting . The below error would also indicate the cause of this issue;
Caused by: java.lang.UnsatisfiedLinkError: /home/nifi/.cache/JNA/temp/jna3526468256198020468.tmp: /home/nifi/.cache/JNA/temp/jna3526468256198020468.tmp: failed to map segment from shared object: Operation not permitted
Solution
In case of security hardening (noexec) for /home, you need to change the default staging directory in NiFi by using the below parameters:
CM > NIFI > Configs NiFi Node Advanced Configuration Snippet (Safety Valve) for staging/bootstrap.conf.xml add the following: Name: java.arg.jna Value: -Djna.tmpdir=/NEWPATH Restart NIFI services.
... View more
06-28-2023
01:01 AM
1 Kudo
After upgrading to CFM 2.1.5 SP1from an older version, PublishKafkaRecord_1_0 stops working with the below error:
PublishKafka_1_0[id=36f1cffe-503d-3181-a562-ae216b5616b5] Processing halted: yielding [1 sec] java.lang.IllegalArgumentException: SaslMechanism value [null] not found at org.apache.nifi.kafka.shared.property.SaslMechanism.lambda$getSaslMechanism$1 (SaslMechanism.java:57) at java.base/java.util.Optional.orElseThrow(Optional.java:408) at org.apache.nifi.kafka.shared.property.SaslMechanism.getSaslMechanism(SaslMechanism. java:57) at org.apache.nifi.kafka.shared.login.DelegatingLoginConfigProvider. getConfiguration(DelegatingLoginConfigProvider.java:51) at org.apache.nifi.kafka. shared.property.provider.StandardKafkaPropertyProvider.setSecurityProperties (StandardKafkaPropertyProvider.java:86) at org.apache.nifi.kafka.shared.property. provider.StandardKafkaPropertyProvider.getProperties(StandardKafkaPropertyProvider. java:75)
This happens due an improvement performed in NIFI-10866.
Resolution
You can switch to newer PublishKafkaRecord processor like PublishKafkaRecord_2_0 or PublishKafkaRecord_2_6, both of which successfully eliminate the issue and also provide newer features as compared to the older 1_0 processor.
... View more
Labels:
06-28-2023
12:47 AM
Summary
As Kafka works with many log segment files and network connections, the Maximum Process File Descriptors setting may need to be increased in some cases in production deployments, if a broker hosts many partitions
Warnings observed
Concerning : Open file descriptors: 18,646. File descriptor limit: 32,768. Percentage in use: 56.90%. Warning threshold: 50.00%.
In case of such open file descriptor warnings in Kafka for CDP, you need to increase file descriptors in Kafka cluster. As per the guide [1], below formula is used
(number of partitions)*(partition size / segment size) Details about these parameters are as follows: a. Number of partitions > it can be seen if you go to SMM and brokers tab. Alternatively, you can simply perform a file count of the kafka data directory. You can find this directory in CM > Kafka > log.dirs b. Partition size > it is the size of log.dirs directory, you can check the disk usage of log.dirs using du -sh command to find this size. c. Segment size > it can be found by going to CM > Kafka > Segment File Size
Additionally, a simple alternate method of calculation can be to simply search for the number of files in Kafka data directory and set the file descriptors to 2-3 times the value.
[1] File descriptor limits
... View more
01-18-2023
01:58 AM
1 Kudo
If there is a necessity for migrating the existing NFS workspace in CML to a new NFS storage, do the following:
Mount the existing NFS directory and the destination NFS directory and manually copy all the data.
Execute the projectCopy.sh script following the instruction in the README file. (Note: The script, by default, will do only the NFS swap. (script and readme files attached))
Modify the NFS mount in the existing helm install:
Save the existing helm values
helm get values mlx-<namespace-name> -n <namespace-name> -o yaml > old.yaml
Modify the NFSServer and ProjectsExport in the old.yaml file.
Example : NFSServer: 10.102.47.132 ProjectsExport: /eng-ml-nfs-azure/ageorge → NFSServer: 10.102.47.134 ProjectsExport: /ageorge-netapp-volume-v3
In my case, I changed NFS server 10.102.47.132:/eng-ml-nfs-azure/ageorge to10.102.47.134:/ageorge-netapp-volume-v3
Get the release name from old.yaml:
grep GitSHA old.yaml Example: GitSHA: 2.0.35-b1
Get the release name from old.yaml:
grep GitSHA old.yaml
Get the release chart cdsw-combined-<GitSHA>.tgz
This is available in ‘dp-mlx-control-plane-app’ pod in the namespace at folder /app/service/resources/mlx-deploy/
Or contact Cloudera support to download the chart.
Delete the jobs and stateful sets (these will get recreated after the helm install)
kubectl --namespace <namespace-name> delete jobs --all
kubectl --namespace <namespace-name> delete statefulsets --all
example: kubectl --namespace ageorgeocbc3 delete jobs --all kubectl --namespace ageorgeocbc3 delete statefulsets --all
Do a helm upgrade to the same release
helm upgrade mlx-<namespace-name> ./cdsw-combined-2.0.35-b1.tgz --install -f./old.yaml --wait --namespace <namespace-name> --debug --timeout 1800s
Example helm upgrade mlx-ageorgeocbc3 ./cdsw-combined-2.0.35-b1.tgz --install -f ./old.yaml --wait --namespace ageorgeocbc3 --debug --timeout 1800s
After the migration is done, you will need to update the NFS mount information in the CML control plane DB.
Copy the CRN from the above window.
If using an external DB, skip to the Updating the DB entry section below.
See the existing NFS configuration using the below common:
select filesystem_id from storage where id = (select storage_id from mlx_instance where crn='<workspace CRN>'); db-mlx=# select filesystem_id from storage where id = (select storage_id from mlx_instance where crn='crn:cdp:ml:us-west-1:48b08407-af73-4a06-8f81-c701d23d7a2a:work space:c3c883dd-3b23-4561-abcd-0cc5f68ed377'); filesystem_id ----------------------------------------- 10.102.47.132:/eng-ml-nfs-azure/ageorge
Update the NFS configuration using the command:
update storage set filesystem_id='<new NFS>' where id = (select storage_id from mlx_instance where crn='<workspace CRN>'); db-mlx=# update storage set filesystem_id='<new NFS>' where id = (select storage_id from mlx_instance where crn='crn:cdp:ml:us-west-1:48b08407-af73-4a06-8f81-c701d23d7a2a:work space:c3c883dd-3b23-4561-abcd-0cc5f68ed377');
UPDATE 1 db-mlx=# select filesystem_id from storage where id = (select storage_id from mlx_instance where crn='crn:cdp:ml:us-west-1:48b08407-af73-4a06-8f81-c701d23d7a2a:work space:c3c883dd-3b23-4561-abcd-0cc5f68ed377'); filesystem_id --------------- <new NFS>
.
... View more
01-09-2023
06:00 PM
1 Kudo
Why scale vertically? Once a DL or DH is provisioned with given parameters (instance types, root volume size etc), later there is currently no way in CDP / cloudbreak to change those parameters. However, with time the load may increase and it may not be possible to terminate and re-launch the DL or DH. In this case you might need to scale vertically via Azure console. On the other hand this scaling plays only partially with CDP features, so the long-term fix is rather upgrading to a larger duty size. Overview Both VM and root disk resize steps do require the VM to be in a stopped state. So you can stop the DH or DL and perform both steps. Scaling the virtual machine size Based on the azure doc for resizing VMs. After resizing the VM the new size will be in effect until you do a repair or upgrade. After that you will need to do the resize again, if you do not have HA setup. If, however, you have a HA setup, keep in mind to change both VMs: in that way, when the repair changes the primary host to the secondary, latter will have the correct size already. The VM has to be in a stopped state. If the VM is in an availability set, then all VMs in that availability set have to be stopped. In case of cloudbreak, that means a whole hostgroup. That is, if you have a datahub, then stop the datahub, if you have a datalake then you will need to stop the environment. Go to “Azure portal” → “Virtual Machines” and find your machines. Click on the one you want to resize to go into the details page of the vm. There, look for submenu Size, and click on it. In the next menu you will get a list of available VM sizes. Please note, that you cannot resize from any type of VM to any other type. It’s best to stay within the same series, i.e. if you have a Standard_D8s v3 VM, then select a bigger one within the Ds v3 series. Click on the desired size and click resize at the bottom of the page. Changing the root disk size After changing the root disk size the new value will be in effect until you do a repair or upgrade. After that you will need to do the resize again, if you do not have HA setup. If, however, you have a HA setup, keep in mind to change root disk for both VMs: in that way, when the repair changes the primary host to the secondary, latter will already have the larger root disk size. Please make sure that the VM is in a stopped state. In case of a DH please stop the DH, in case of a DL you need to stop the environment. Go to azure portal, and select the VM where you want to resize the root disk. On the details page click on “Disks”. Disks will be listed in two groups, the upper one will be the root disk. Click on it. On the upcoming details page please click on “Size and performance“. You will get a list of available root disk SKUs and size. In the custom disk size text box please enter the desired disk size. Note, that this may change the SKU for the disk. On the bottom of the page please press resize. Going back to the VM “Disks” menu you should see the new value after a refresh. However, on the VM, you will need to resize the partition and the filesystem manually. Start the DH or environment, and after the VM started ssh into it. Issue two commands: lsblk will show you the size of the attached disk and the size of the partition mounted from that disk. Disk sda is reported to have 567G (the size we have set in azure portal), however sda2 is still 133.5G, the original size. Issue command growpart /dev/sda 2 to grow the partition that is mounted under / 1[root@gpapp-azvolres5-dl-master0 cloudbreak]# growpart /dev/sda 2 2CHANGED: partition=2 start=1026048 old: size=279992287 end=281018335 new: size=1188059103 end=1189085151 If you now again issue lsblk then you will see that the partition was grown You thus now need to resize the filesystem under / , issuethe command xfs_growfs / to do the resize. Resizing other disks than the root disk From cloudbreak point of view there are 3 types of disks on an azure linux VM: The root disk: can be resized as discussed above, but repair / upgrade reverts the change A temporary storage: size is fixed, depends on VM size. Data disks: can be resized, but might not make much sense.
... View more
10-26-2022
07:36 PM
Symptoms: In versions prior to CDH 6.3.1, Node Managers can enter unhealthy state with below error observed in NM logs 2022-10-20 15:31:32,487 ERROR logaggregation.AggregatedLogFormat (AggregatedLogFormat.java:logErrorMessage(299)) - Error aggregating log file. Log file : /hadoop/ssd01/yarn/log/application_1665989140069_135925/container_e93_1665989140069_135925_01_000002/history.txt.appattempt_1665989140069_135925_000001. /hadoop/ssd01/yarn/log/application_1665989140069_135925/container_e93_1665989140069_135925_01_000002/history.txt.appattempt_1665989140069_135925_000001 (Permission denied)
2022-10-20 15:28:19,556 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(532)) - Exit code: 35
2022-10-20 15:28:19,556 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(532)) - Exception message: Launch container failed
2022-10-20 15:28:19,556 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(532)) - Shell error output: Could not create container dirsCould not create local files and directories 2022-10-20 15:28:19,557 ERROR launcher.ContainerLaunch (ContainerLaunch.java:call(327)) - Failed to launch container due to configuration error.
org.apache.hadoop.yarn.exceptions.ConfigurationException: Linux Container Executor reached unrecoverable exception Permissions for yarn_nodemanager_local_dirs needs to checked and rectified if they are not correct. The actual issue is that most of these exit codes doesn't fall under the criteria where NM should be marked unhealthy. Based on above we might hitting known Issue https://issues.apache.org/jira/browse/YARN-8751 https://issues.apache.org/jira/browse/YARN-9833 Resolution: 1 Clear Cache and restart affected NodeManagers should resolve the issue in the older versions. 2. This issue is permanently fixed in CDH 6.3.1 and later. https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_631_fixed_issues.html#fixed_issues YARN-9833 - Race condition when DirectoryCollection.checkDirs() runs during container launch
... View more