Created 10-01-2024 12:57 PM
Hello Dear all,
I am encountering an issue with an Oozie workflow that involves moving data from an external table to HDFS. The Oozie job successfully completes the process of bringing data from the external table to HDFS. After that, Sqoop should transfer this data into an ORC table.
The issue occurs at a specific point in the Oozie workflow, where it tries to read two files from an HDFS directory: one containing the YARN application_id and the other with the instructions for database insertion. The job fails, and the directory remains empty. In the logs, I get the following error:
org.apache.hadoop.ipc.Client - Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby.
Context:
This error began after an incident where directories in HDFS were deleted. This may have affected the failover process and high availability (HA) of HDFS.
The error seems to be related to the standby NameNode, suggesting that the client (Sqoop/Oozie) is trying to read/write to the standby NameNode, where read/write operations are not allowed.
My Hypothesis: This problem may be related to the HDFS failover. The incident with the deleted directories may have affected HA, preventing Sqoop or Oozie from correctly communicating with the active NameNode, resulting in the failure to register files and the subsequent error.
If the standby NameNode is not properly synchronized, or if failover is not functioning correctly, this could explain why the job is unable to write files to the HDFS directory, directly affecting the progress of the Oozie/Sqoop job.
Questions:
Could this issue be related to HDFS failover and the HA configuration being affected by the deleted directories?
How can I validate whether the problem is related to HDFS HA and failover?
Is there a way to force Sqoop/Oozie to properly use the active NameNode instead of the standby?
I have checked the HA configuration, and failover seems to be functioning as the standby takes over when the active NameNode is restarted. However, the error persists when trying to read or write to HDFS.
I appreciate any help and suggestions.
Best regards
Created on 10-01-2024 01:25 PM - edited 10-01-2024 01:25 PM
Hi @evanle96 !
Could this issue be related to HDFS failover and the HA configuration being affected by the deleted directories?
No, I don't think so. Deleting directories should not affect NN failover or HA configuration, unless there is something is fundamentally wrong with your setup or hardware. You might elaborate a bit more on what happened here?
How can I validate whether the problem is related to HDFS HA and failover?
What you mention in the your last question: Triggering a manual failover and checking if basic read write from CLI works, that should be a good start.
Is there a way to force Sqoop/Oozie to properly use the active NameNode instead of the standby?
HDFS clients in general should have a list of all NameNodes available to them. If the client gets the above error when connecting, it should try to connect the next available NN. If that's not happening, likely there is some issue with the client's configuration (core-site.xml, hdfs-site.xml). It is possible that it only knows about one NN (which is the standby), or the config is outdated, and pointing to an old, decommissioned host, or it cannot connect due to network issues. Your logs should tell more if the job is actually trying to fail-over to the other NN, so a bit more context around the error message (more logs) would be useful to see what's going on exactly.
I have checked the HA configuration, and failover seems to be functioning as the standby takes over when the active NameNode is restarted. However, the error persists when trying to read or write to HDFS.
Do you mean the sqoop job fails, or you cannot read/write with simple HDFS CLI commands, no matter what NN is the active?
Created 10-01-2024 01:37 PM
Hi @zegab
Some directories inside the user folder in HDFS were deleted, which temporarily affected some services. Afterward, I reinstalled the clients and libraries for the services, and everything seems to be working fine now.
Validation of basic read/write after manual failover:
How exactly can I validate read/write operations on HDFS from the CLI after a manual failover?
Ensuring that Oozie and Sqoop access both active and standby NameNodes correctly:
I would like to validate how Oozie and Sqoop are accessing the active and standby NameNodes. I want to ensure they are properly configured to recognize the active NameNode and handle failover correctly.
Log details and Sqoop configuration warning:
Here’s part of the log I’m seeing:
>>> Invoking Sqoop command line now >>>
17:05:55.428 [main] WARN org.apache.sqoop.tool.SqoopTool - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
17:05:55.479 [main] INFO org.apache.sqoop.Sqoop - Running Sqoop version: 1.4.7.7.1.7.2000-305
17:05:55.537 [main] WARN org.apache.hadoop.ipc.Client - Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
It seems that Sqoop is trying to connect to the standby NameNode instead of the active one, resulting in the StandbyException. Additionally, there’s a warning regarding the $SQOOP_CONF_DIR not being set. Could this indicate that Sqoop is missing some configuration settings and isn't recognizing the correct NameNode?
Any further advice on validating the failover process, ensuring Oozie and Sqoop use the active NameNode, and resolving the $SQOOP_CONF_DIR issue would be greatly appreciated.
Best regards,
Leonardo
Created 10-10-2024 05:13 AM
Hi @evanle96
This error is not an issue. Usually in the HA setup the call goes to both the NN and the Active NN acknowledges the call but the standby NN will through this warning. So you can ignore this warning here.