Member since
12-20-2022
81
Posts
19
Kudos Received
8
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
135 | 04-02-2025 11:35 PM | |
134 | 03-23-2025 11:30 PM | |
115 | 03-06-2025 10:11 PM | |
516 | 10-29-2024 11:53 PM | |
405 | 07-31-2024 01:11 AM |
08-19-2024
07:34 AM
1 Kudo
@hadoopranger I am circling back on this. Did you try the steps shared in my previous reply? If yes, let us know how did it go. If you find my reply helpful, You may mark it as the accepted solution. You can also say thanks by clicking on the thumbs up button. V
... View more
08-16-2024
09:06 AM
Hi @husseljo , Please mark this "Accept as solution" if you find my answer helped you.
... View more
08-13-2024
03:12 AM
1 Kudo
So do you mean that there is no need of migration, just bring up the new server, assign required roles and then decomm the old one?
... View more
08-13-2024
03:11 AM
1 Kudo
Thank you for the response. But its my old post and I was able to get Yarn QM working then.
... View more
08-07-2024
03:52 AM
1 Kudo
Hi @Muba Please check if the Hue user user1 is not part of HDFS superuser group. This might give the user access to file browser and all the hdfs directories.
... View more
07-31-2024
01:15 AM
1 Kudo
Please check the permission for the nodemanager directory. The owner or group must be yarn. Then try to decommission and it will distribute the blocks to other datanodes and then will decommission.
... View more
07-31-2024
01:11 AM
1 Kudo
Please check if you have tgt created for all the Nodemanagers. It will fix the issue.
... View more
05-28-2024
09:22 PM
When an application or job that typically completes in a short time is taking significantly longer than expected, it's essential to systematically troubleshoot the issue to identify and resolve the bottleneck. Here are some steps and areas to focus on when diagnosing performance issues in such scenarios: 1. Understand the Baseline and Gather Information Historical Performance Data: Compare the current run with previous runs. Identify what has changed in terms of input size, configuration, environment, etc. Logs and Metrics: Gather logs and metrics from the application, YARN ResourceManager, and NodeManager. 2. Monitor Resource Utilization CPU, Memory, and Disk Usage: Check the resource usage on the nodes running the application. High CPU, memory, or disk I/O usage can indicate bottlenecks. Network Utilization: Check network usage, especially if the job involves significant data transfer between nodes. 3. Examine YARN and Application Logs YARN Logs: Access the logs through the YARN ResourceManager web UI. Look for errors, warnings, and unusual delays. Application Master (AM) Logs: Review the AM logs for any signs of retries, timeouts, or other issues. Container Logs: Check the logs of individual containers for errors and performance issues. 4. Check for Resource Contention NodeManager Logs: Look for signs of resource contention, such as high wait times for container allocation. Cluster Load: Check if other jobs are running concurrently and consuming significant resources. 5. Investigate Job Configuration Parallelism: Ensure the job is correctly configured for parallel execution (e.g., number of mappers and reducers in a MapReduce job). Resource Allocation: Verify that the job has sufficient resources allocated (e.g., memory, vCores). 6. Data Skew and Distribution Data Skew: Analyze the input data for skew. Uneven data distribution can cause some tasks to take much longer than others. Task Distribution: Check if certain tasks or stages are taking disproportionately longer. 7. Network and I/O Bottlenecks Shuffle and Sort Phase: In Hadoop and Spark, the shuffle phase can be a bottleneck. Monitor the shuffle performance and look for skew or excessive data transfer. HDFS or Storage I/O: Ensure that the underlying storage (HDFS, S3, etc.) is performing optimally and there are no bottlenecks. 8. Garbage Collection and JVM Tuning GC Logs: If the application is JVM-based, check the garbage collection logs for excessive GC pauses. JVM Heap Size: Verify that the JVM heap size is appropriately configured to avoid frequent GC. 9. Configuration Parameters and Tuning YARN Configuration: Check for misconfigurations in YARN resource allocation settings. Application-specific Tuning: Tune parameters specific to the application framework (e.g., Spark, MapReduce). 10. External Dependencies External Services: If the application interacts with external services (e.g., databases, APIs), ensure they are not the bottleneck. Dependency Failures: Look for timeouts or failures in external service calls. Detailed Steps for Specific Frameworks For Hadoop MapReduce Jobs Check Job History Server: Analyze the job in the Job History Server web UI. Identify slow tasks and investigate their logs. Analyze Task Attempts: Look for tasks that have failed and retried multiple times. Identify tasks with unusually high execution times. For Apache Spark Jobs Spark UI: Use the Spark web UI to analyze stages, tasks, and jobs. Look for stages that have long task durations or high task counts. Executor Logs: Check the logs of individual Spark executors for errors and warnings. Driver Logs: Examine the driver logs for signs of job bottlenecks or delays. Conclusion Systematically troubleshooting a job that is taking longer than usual involves a combination of monitoring resource utilization, examining logs, analyzing job configurations, and investigating data distribution and skew. By following these steps and using the right tools, you can identify and resolve the performance bottlenecks effectively. If the issue persists, consider breaking down the problem further or seeking help from more detailed profiling tools or experts familiar with your specific application framework and environment.
... View more
04-23-2024
06:59 AM
If its a tez application, AM logs will show how much memory is currently allocated/consumed by the application & how much free resources available in the queue at that specific time. eg., 2024-04-22 23:27:20,636 [INFO] [AMRM Callback Handler Thread] |rm.YarnTaskSchedulerService|: Allocated: <memory:843776, vCores:206> Free: <memory:2048, vCores:306> pendingRequests: 0 delayedContainers: 205 heartbeats: 101 lastPreemptionHeartbeat: 100 2024-04-22 23:27:30,660 [INFO] [AMRM Callback Handler Thread] |rm.YarnTaskSchedulerService|: Allocated: <memory:155648, vCores:38> Free: <memory:495616, vCores:356> pendingRequests: 0 delayedContainers: 38 heartbeats: 151 lastPreemptionHeartbeat: 150 This allocation details will be logged frequently in Tez AM logs.
... View more
04-23-2024
06:18 AM
Hi @DanhH , Please accept this as a solution if it helped.
... View more