Member since
02-02-2021
116
Posts
2
Kudos Received
5
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
745 | 08-13-2021 09:44 AM | |
3692 | 04-27-2021 04:23 PM | |
1367 | 04-26-2021 10:47 AM | |
923 | 03-29-2021 06:01 PM | |
2744 | 03-17-2021 04:53 PM |
05-28-2024
09:22 PM
When an application or job that typically completes in a short time is taking significantly longer than expected, it's essential to systematically troubleshoot the issue to identify and resolve the bottleneck. Here are some steps and areas to focus on when diagnosing performance issues in such scenarios: 1. Understand the Baseline and Gather Information Historical Performance Data: Compare the current run with previous runs. Identify what has changed in terms of input size, configuration, environment, etc. Logs and Metrics: Gather logs and metrics from the application, YARN ResourceManager, and NodeManager. 2. Monitor Resource Utilization CPU, Memory, and Disk Usage: Check the resource usage on the nodes running the application. High CPU, memory, or disk I/O usage can indicate bottlenecks. Network Utilization: Check network usage, especially if the job involves significant data transfer between nodes. 3. Examine YARN and Application Logs YARN Logs: Access the logs through the YARN ResourceManager web UI. Look for errors, warnings, and unusual delays. Application Master (AM) Logs: Review the AM logs for any signs of retries, timeouts, or other issues. Container Logs: Check the logs of individual containers for errors and performance issues. 4. Check for Resource Contention NodeManager Logs: Look for signs of resource contention, such as high wait times for container allocation. Cluster Load: Check if other jobs are running concurrently and consuming significant resources. 5. Investigate Job Configuration Parallelism: Ensure the job is correctly configured for parallel execution (e.g., number of mappers and reducers in a MapReduce job). Resource Allocation: Verify that the job has sufficient resources allocated (e.g., memory, vCores). 6. Data Skew and Distribution Data Skew: Analyze the input data for skew. Uneven data distribution can cause some tasks to take much longer than others. Task Distribution: Check if certain tasks or stages are taking disproportionately longer. 7. Network and I/O Bottlenecks Shuffle and Sort Phase: In Hadoop and Spark, the shuffle phase can be a bottleneck. Monitor the shuffle performance and look for skew or excessive data transfer. HDFS or Storage I/O: Ensure that the underlying storage (HDFS, S3, etc.) is performing optimally and there are no bottlenecks. 8. Garbage Collection and JVM Tuning GC Logs: If the application is JVM-based, check the garbage collection logs for excessive GC pauses. JVM Heap Size: Verify that the JVM heap size is appropriately configured to avoid frequent GC. 9. Configuration Parameters and Tuning YARN Configuration: Check for misconfigurations in YARN resource allocation settings. Application-specific Tuning: Tune parameters specific to the application framework (e.g., Spark, MapReduce). 10. External Dependencies External Services: If the application interacts with external services (e.g., databases, APIs), ensure they are not the bottleneck. Dependency Failures: Look for timeouts or failures in external service calls. Detailed Steps for Specific Frameworks For Hadoop MapReduce Jobs Check Job History Server: Analyze the job in the Job History Server web UI. Identify slow tasks and investigate their logs. Analyze Task Attempts: Look for tasks that have failed and retried multiple times. Identify tasks with unusually high execution times. For Apache Spark Jobs Spark UI: Use the Spark web UI to analyze stages, tasks, and jobs. Look for stages that have long task durations or high task counts. Executor Logs: Check the logs of individual Spark executors for errors and warnings. Driver Logs: Examine the driver logs for signs of job bottlenecks or delays. Conclusion Systematically troubleshooting a job that is taking longer than usual involves a combination of monitoring resource utilization, examining logs, analyzing job configurations, and investigating data distribution and skew. By following these steps and using the right tools, you can identify and resolve the performance bottlenecks effectively. If the issue persists, consider breaking down the problem further or seeking help from more detailed profiling tools or experts familiar with your specific application framework and environment.
... View more
04-23-2024
06:59 AM
If its a tez application, AM logs will show how much memory is currently allocated/consumed by the application & how much free resources available in the queue at that specific time. eg., 2024-04-22 23:27:20,636 [INFO] [AMRM Callback Handler Thread] |rm.YarnTaskSchedulerService|: Allocated: <memory:843776, vCores:206> Free: <memory:2048, vCores:306> pendingRequests: 0 delayedContainers: 205 heartbeats: 101 lastPreemptionHeartbeat: 100 2024-04-22 23:27:30,660 [INFO] [AMRM Callback Handler Thread] |rm.YarnTaskSchedulerService|: Allocated: <memory:155648, vCores:38> Free: <memory:495616, vCores:356> pendingRequests: 0 delayedContainers: 38 heartbeats: 151 lastPreemptionHeartbeat: 150 This allocation details will be logged frequently in Tez AM logs.
... View more
11-21-2023
09:35 AM
It seems like you want to run the Tez example "OrderedWordCount" using the tez-examples*.jar file. The OrderedWordCount example is part of the Tez examples and demonstrates how to perform a word count with ordering. Assuming you have Tez installed on your system, you can follow these steps: export TEZ_CONF_DIR=/etc/tez/conf/
export TEZ_HOME=/opt/cloudera/parcels/CDH/lib/tez/
export HADOOP_CLASSPATH=${TEZ_CONF_DIR}:${TEZ_HOME}/bin/*:${TEZ_HOME}/*
yarn jar ${TEZ_HOME}/bin/tez-examples-*.jar orderedwordcount /somewhere/input /somewhere/output
... View more
05-15-2023
05:27 AM
@ryu If @steven-matison answered your question, please mark his reply as the solution, as it will make it easier for others to find the answer in the future.
... View more
10-25-2022
12:53 PM
@ryu CDP Public Cloud Azure or CDP Private Cloud on Azure VMs? To link a NiFi outside of the cluster, you will need to provide that nifi with the files from the CDP Cluster. For example core-site.xml, hdfs-site.xml. Outside of that configuration, you will need to do some networking to allow access between systems, and then last but not least deal with access/auth and kerberos. If you are already working on some of these areas, be sure to include screen shots of processors, controller services, configs, etc.
... View more
01-27-2022
05:22 PM
Hi @ryu
Just from reading the absolute file path you've called out, evidently you are running log4j version 1.
To back up a bit for the sake of other community members who might be reading this: the chief reason we're even talking about this is because Log4j 1 reached its end of life (EOL) and is no longer officially supported by the Apache™ Logging Services™ Project as of 5 August 2015, over six years ago now. But largely due to the increased scrutiny of Log4j in general in the wake of the CVE-2021-44228 vulnerability (which impacted Log4j 2), a new vulnerability that was rated lower in severity has come to light, CVE-2021-4104, which does affect Log4j 1. But again, Log4j 1 has reached EOL, and Apache's Logging Services™ Project isn't providing any more releases for Log4j 1, even to remediate serious security vulnerabilities. For both of these and for other reasons, the best practical approach is to upgrade to a more up-to-date data platform that is being actively supported.
Cloudera's current Enterprise Data Platform, since the Fall of 2019, is Cloudera Data Platform (CDP), which in it's on-premises "form factor" is now called CDP Private Cloud. CDP supersedes HDP as Cloudera's Enterprise Data Platform, and as an aside, HDP 2.6.1 reached it's end of support date in December 2020 (open that link and then expand the section labeled "Hortonworks Data Platform (HDP)" underneath Current End of Support (EoS) Dates).
As a core part of its business, Cloudera addresses customer needs for vulnerability remediation as part of the benefits of a subscription agreement even when Apache no longer supports an impacted component.
You can read Cloudera's judgement about how concerned you should be about that Log4j 1 vulnerability here: Cloudera response to CVE-2021-4104
The reason upgrading is the best practical approach is because arguably the proper way to upgrade log4j is to go through the source code for all the affected components that use the Log4j version you are trying to avoid, become intimate with the details of how they use the various Logging APIs and then update or even totally rewrite the code that uses those existing, risk-exposed APIs to use the APIs in the new, replacement version of log4j 2 that is not exposed to known vulnerabilities (presumably 2.15.x or later). Then recompile against log4j 2 exclusively, unit test and release each changed component, and then test the entire system as a whole for regressions. And then finally migrate the completed product with only log4j 2 to production. As you probably understand, that takes a lot of engineering effort and it's not something a data platform administrator or even a data platform team at an enterprise that is using HDP for it's internal data management needs, for example, should be expected to complete on their own.
Upgrading the platform to a new, more up-to-date release that is actively being maintained is the next best thing, and as a practical matter, its better. It allows data platform users to take advantage of the fact that the data platform provider/vendor is going to have those substantial engineering resources and be able to bring them to bear on the necessary API updates on an ongoing basis and in a timely fashion.
If for whatever reason you aren't able to or are unwilling to upgrade and don't have a subscription agreement…well, just engaging in a bit of logical deduction from first principles (because I don't have access to an HDP 2.6.1-based cluster at the moment to actually try it) I think the short answer to this portion of your question:
Can we just replace the log4j jar file with an upgraded version?
…is a qualified "No". Some of the critical APIs for Log4j 2 are simply not backwardly-compatible with Log4j 1, so you should assume that just dropping in the Log4j 2 .jar files into an existing HDP installation is not going to work without issues. Other members of the Cloudera Community have reported that even dropping the Log4j 2 .jar file(s) into an installation of CDH 6.3.x, which was built with Log4j 2 specifically, produced less than desirable results.
However, there does exist a Log4j 1.x bridge which reportedly will "forward" all requests for Log4j 1 to Log4j 2, assuming that you have a valid Log4j 2 installation, so you might want to explore that option if you can test it out on a non-production cluster first. It also requires that you do a thorough job of removing any Log4j 1 jars in the application's CLASSPATH for any Hadoop component. It goes without saying that Cloudera doesn't support this however and again, I haven't tried it so you should only proceed down this path if you are desperate to remove a Log4j 1 installation, don't have or can't obtain a subscription agreement and have a solid plan to roll back the change if it doesn't work out.
... View more
12-16-2021
12:40 PM
Hi @willx , Is there a way to see if the hadoop path is a volume or a directory?
... View more
11-21-2021
10:07 PM
We have to check from where the Ambari is triggering the command. If possible please share the relevant Ambari server error logs during the operation.
... View more
11-15-2021
03:06 AM
I really liked the way you highlighted some really important and significant points
... View more
10-16-2021
09:00 AM
1 Kudo
@Faizan_Ali Thanks for the explanation. Makes sense. So while an application is running, it logs the container logs into a local directory "$ {yarn.nodemanager.log-dirs}/application_${appid}" then after the application is completed, it aggregates the logs into yarn.nodemanager.remote-app-log-dir. Ok thanks for the explanation.
... View more