Member since
02-02-2021
116
Posts
2
Kudos Received
5
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 1312 | 08-13-2021 09:44 AM | |
| 5996 | 04-27-2021 04:23 PM | |
| 2340 | 04-26-2021 10:47 AM | |
| 1533 | 03-29-2021 06:01 PM | |
| 4205 | 03-17-2021 04:53 PM |
05-28-2024
09:22 PM
When an application or job that typically completes in a short time is taking significantly longer than expected, it's essential to systematically troubleshoot the issue to identify and resolve the bottleneck. Here are some steps and areas to focus on when diagnosing performance issues in such scenarios: 1. Understand the Baseline and Gather Information Historical Performance Data: Compare the current run with previous runs. Identify what has changed in terms of input size, configuration, environment, etc. Logs and Metrics: Gather logs and metrics from the application, YARN ResourceManager, and NodeManager. 2. Monitor Resource Utilization CPU, Memory, and Disk Usage: Check the resource usage on the nodes running the application. High CPU, memory, or disk I/O usage can indicate bottlenecks. Network Utilization: Check network usage, especially if the job involves significant data transfer between nodes. 3. Examine YARN and Application Logs YARN Logs: Access the logs through the YARN ResourceManager web UI. Look for errors, warnings, and unusual delays. Application Master (AM) Logs: Review the AM logs for any signs of retries, timeouts, or other issues. Container Logs: Check the logs of individual containers for errors and performance issues. 4. Check for Resource Contention NodeManager Logs: Look for signs of resource contention, such as high wait times for container allocation. Cluster Load: Check if other jobs are running concurrently and consuming significant resources. 5. Investigate Job Configuration Parallelism: Ensure the job is correctly configured for parallel execution (e.g., number of mappers and reducers in a MapReduce job). Resource Allocation: Verify that the job has sufficient resources allocated (e.g., memory, vCores). 6. Data Skew and Distribution Data Skew: Analyze the input data for skew. Uneven data distribution can cause some tasks to take much longer than others. Task Distribution: Check if certain tasks or stages are taking disproportionately longer. 7. Network and I/O Bottlenecks Shuffle and Sort Phase: In Hadoop and Spark, the shuffle phase can be a bottleneck. Monitor the shuffle performance and look for skew or excessive data transfer. HDFS or Storage I/O: Ensure that the underlying storage (HDFS, S3, etc.) is performing optimally and there are no bottlenecks. 8. Garbage Collection and JVM Tuning GC Logs: If the application is JVM-based, check the garbage collection logs for excessive GC pauses. JVM Heap Size: Verify that the JVM heap size is appropriately configured to avoid frequent GC. 9. Configuration Parameters and Tuning YARN Configuration: Check for misconfigurations in YARN resource allocation settings. Application-specific Tuning: Tune parameters specific to the application framework (e.g., Spark, MapReduce). 10. External Dependencies External Services: If the application interacts with external services (e.g., databases, APIs), ensure they are not the bottleneck. Dependency Failures: Look for timeouts or failures in external service calls. Detailed Steps for Specific Frameworks For Hadoop MapReduce Jobs Check Job History Server: Analyze the job in the Job History Server web UI. Identify slow tasks and investigate their logs. Analyze Task Attempts: Look for tasks that have failed and retried multiple times. Identify tasks with unusually high execution times. For Apache Spark Jobs Spark UI: Use the Spark web UI to analyze stages, tasks, and jobs. Look for stages that have long task durations or high task counts. Executor Logs: Check the logs of individual Spark executors for errors and warnings. Driver Logs: Examine the driver logs for signs of job bottlenecks or delays. Conclusion Systematically troubleshooting a job that is taking longer than usual involves a combination of monitoring resource utilization, examining logs, analyzing job configurations, and investigating data distribution and skew. By following these steps and using the right tools, you can identify and resolve the performance bottlenecks effectively. If the issue persists, consider breaking down the problem further or seeking help from more detailed profiling tools or experts familiar with your specific application framework and environment.
... View more
04-23-2024
06:59 AM
If its a tez application, AM logs will show how much memory is currently allocated/consumed by the application & how much free resources available in the queue at that specific time. eg., 2024-04-22 23:27:20,636 [INFO] [AMRM Callback Handler Thread] |rm.YarnTaskSchedulerService|: Allocated: <memory:843776, vCores:206> Free: <memory:2048, vCores:306> pendingRequests: 0 delayedContainers: 205 heartbeats: 101 lastPreemptionHeartbeat: 100 2024-04-22 23:27:30,660 [INFO] [AMRM Callback Handler Thread] |rm.YarnTaskSchedulerService|: Allocated: <memory:155648, vCores:38> Free: <memory:495616, vCores:356> pendingRequests: 0 delayedContainers: 38 heartbeats: 151 lastPreemptionHeartbeat: 150 This allocation details will be logged frequently in Tez AM logs.
... View more
01-27-2022
05:22 PM
Hi @ryu
Just from reading the absolute file path you've called out, evidently you are running log4j version 1.
To back up a bit for the sake of other community members who might be reading this: the chief reason we're even talking about this is because Log4j 1 reached its end of life (EOL) and is no longer officially supported by the Apache™ Logging Services™ Project as of 5 August 2015, over six years ago now. But largely due to the increased scrutiny of Log4j in general in the wake of the CVE-2021-44228 vulnerability (which impacted Log4j 2), a new vulnerability that was rated lower in severity has come to light, CVE-2021-4104, which does affect Log4j 1. But again, Log4j 1 has reached EOL, and Apache's Logging Services™ Project isn't providing any more releases for Log4j 1, even to remediate serious security vulnerabilities. For both of these and for other reasons, the best practical approach is to upgrade to a more up-to-date data platform that is being actively supported.
Cloudera's current Enterprise Data Platform, since the Fall of 2019, is Cloudera Data Platform (CDP), which in it's on-premises "form factor" is now called CDP Private Cloud. CDP supersedes HDP as Cloudera's Enterprise Data Platform, and as an aside, HDP 2.6.1 reached it's end of support date in December 2020 (open that link and then expand the section labeled "Hortonworks Data Platform (HDP)" underneath Current End of Support (EoS) Dates).
As a core part of its business, Cloudera addresses customer needs for vulnerability remediation as part of the benefits of a subscription agreement even when Apache no longer supports an impacted component.
You can read Cloudera's judgement about how concerned you should be about that Log4j 1 vulnerability here: Cloudera response to CVE-2021-4104
The reason upgrading is the best practical approach is because arguably the proper way to upgrade log4j is to go through the source code for all the affected components that use the Log4j version you are trying to avoid, become intimate with the details of how they use the various Logging APIs and then update or even totally rewrite the code that uses those existing, risk-exposed APIs to use the APIs in the new, replacement version of log4j 2 that is not exposed to known vulnerabilities (presumably 2.15.x or later). Then recompile against log4j 2 exclusively, unit test and release each changed component, and then test the entire system as a whole for regressions. And then finally migrate the completed product with only log4j 2 to production. As you probably understand, that takes a lot of engineering effort and it's not something a data platform administrator or even a data platform team at an enterprise that is using HDP for it's internal data management needs, for example, should be expected to complete on their own.
Upgrading the platform to a new, more up-to-date release that is actively being maintained is the next best thing, and as a practical matter, its better. It allows data platform users to take advantage of the fact that the data platform provider/vendor is going to have those substantial engineering resources and be able to bring them to bear on the necessary API updates on an ongoing basis and in a timely fashion.
If for whatever reason you aren't able to or are unwilling to upgrade and don't have a subscription agreement…well, just engaging in a bit of logical deduction from first principles (because I don't have access to an HDP 2.6.1-based cluster at the moment to actually try it) I think the short answer to this portion of your question:
Can we just replace the log4j jar file with an upgraded version?
…is a qualified "No". Some of the critical APIs for Log4j 2 are simply not backwardly-compatible with Log4j 1, so you should assume that just dropping in the Log4j 2 .jar files into an existing HDP installation is not going to work without issues. Other members of the Cloudera Community have reported that even dropping the Log4j 2 .jar file(s) into an installation of CDH 6.3.x, which was built with Log4j 2 specifically, produced less than desirable results.
However, there does exist a Log4j 1.x bridge which reportedly will "forward" all requests for Log4j 1 to Log4j 2, assuming that you have a valid Log4j 2 installation, so you might want to explore that option if you can test it out on a non-production cluster first. It also requires that you do a thorough job of removing any Log4j 1 jars in the application's CLASSPATH for any Hadoop component. It goes without saying that Cloudera doesn't support this however and again, I haven't tried it so you should only proceed down this path if you are desperate to remove a Log4j 1 installation, don't have or can't obtain a subscription agreement and have a solid plan to roll back the change if it doesn't work out.
... View more
12-16-2021
12:40 PM
Hi @willx , Is there a way to see if the hadoop path is a volume or a directory?
... View more
10-16-2021
09:00 AM
1 Kudo
@Faizan_Ali Thanks for the explanation. Makes sense. So while an application is running, it logs the container logs into a local directory "$ {yarn.nodemanager.log-dirs}/application_${appid}" then after the application is completed, it aggregates the logs into yarn.nodemanager.remote-app-log-dir. Ok thanks for the explanation.
... View more
09-09-2021
09:07 AM
DISTCP or import/export is not supported for ACID tables. You need to follow below mechanism: Distscp for ACID is not supported ,you have 2 approaches: Approach 1 ============= 1. Assuming that you have ACID in source and target clusters. 2. Create a external in source and target clusters. 3. Copy the data from ACID TO external in SOURCE CLUSTER INSERT into external select * from acid. 4. Perfrom distscp from source to target for external table. 5. Copy the data from external TO ACID IN SOURCE CLUSTER INSERT into acid select * from external. Approach 2 ========= Use DLM Refrence: https://community.cloudera.com/t5/Support-Questions/HIVE-ACID-table-Not-enough-history-available-for-0-x-Oldest/td-p/204551
... View more
08-13-2021
09:44 AM
Ok nevermind, it was a firewall issue. Everything is working now. Thanks,
... View more
08-12-2021
06:23 AM
Thanks it worked.
... View more
08-09-2021
10:48 AM
2 Kudos
Hi @ryu , I have recently copied the hive tables from our Production cluster to non production cluster using distcp the location of hive warehouse directory from Prod to non prod, After running distcp we created the table schema on non prod as same as Prod using 'create table'. If table consist partition then please apply 'alter table' to add partition. We are also using hive replication to copy the tables from our Prod to DR cluster. If this has helped you then please mark the answer as solution.
... View more
07-23-2021
03:41 AM
Hello @ryu As mentioned by @arunek95, we assume Phoenix is enabled for the Cluster. If not, Kindly enable Phoenix & try the Command again. The Logging indicates HDP v2.6.1.0 with Phoenix v4.7. The Directory "/usr/lib/phoenix/" has the Phoenix Client & you mentioned the same Directory has Phoenix Server Jar as well. Kindly verify if the Permission on the JAR is Correct & confirm via "jar -tvf" on the Phoenix Server Jar that the Class "MetaDataEndpointImpl" is included in the same. The Error indicates the Phoenix creating the SYSTEM Tables (Upon 1st Connection to Phoenix) is encountering the Error. In our Internal Setup, We see the Phoenix-Server Jar is present in HBase Lib Path as well, pointing to the Phoenix-Server Jar in Phoenix Lib Path as SymLink: /usr/hdp/<Version>/hbase/lib/phoenix-server.jar -> /usr/hdp/<Version>/phoenix/phoenix-server.jar Kindly ensure the Phoenix Server JAR is present in HBase Lib Directory as well. Additionally, Review the Master Logs to check for the Error Message at HBase Level as well. - Smarak
... View more