About cnauroth

cnauroth · ‎12-30-2016

@Joshua Adeleke , there is a statement in the article that "block deletion activity in HDFS is asynchronous". This statement also applies when finalizing an upgrade. Since this processing happens asynchronously, it's difficult to put an accurate wall clock estimate on it. In my experience, I've generally seen the physical on-disk block deletions start happening 2-5 minutes after finalizing an upgrade.

cnauroth · ‎06-29-2016

@Payel Datta, have you tried using NetCat to test if that can bind and listen on these ports, like I suggested in an earlier comment? "nc -l 2888" or "nc -l 3888". If a similar failure happens with NetCat, then that would confirm that you need to investigate further for some kind of networking problem at the hosts, though I don't know exactly what that networking problem would be from the information here.

cnauroth · ‎06-23-2016

Hello @Xiaobing Zhou, This may indicate that either a NameNode or JournalNodes were unresponsive for a period of time. This can lead to a cascading failure, whereby a NameNode HA failover occurs, the other NameNode becomes active, the previous NameNode thinks it is still active, and then QJM rejects that NameNode for not operating within the same "epoch" (logical period of time). This is by design, as QJM is intended to prevent 2 NameNodes from mistakenly acting as active in a split-brain scenario. There are multiple potential reasons for unresponsiveness in the NameNode/JournalNode interaction. Reviewing logs from the NameNodes and JournalNodes would likely reveal more details. There are several common causes to watch for: A long stop-the-world garbage collection pause may surpass the timeout threshold for the call. Garbage collection logging would show what kind of garbage collection activity the process is doing. You might also see log messages about the "JvmPauseMonitor". Consider reviewing the article NameNode Garbage Collection Configuration: Best Practices and Rationale to make sure your cluster's heap and garbage collection settings match best practices. In environments that integrate with LDAP for resolution of users' group memberships, load problems on the LDAP infrastructure can cause delays. In extreme cases, we have seen such timeouts at the JournalNodes cause edit logging calls to fail, which causes a NameNode abort and an HA failover. See Hadoop and LDAP: Usage, Load Patterns and Tuning for a more detailed description and potential mitigation steps. It is possible that there is a failure in network connectivity between the NameNode and the JournalNodes. This tends to be rare, because NameNodes and JournalNodes tend to be colocated on the same host or placed relatively close to one another in the network topology. Still, it is worth investigating that basic network connectivity between all NameNode hosts and all JournalNode hosts is working fine.

cnauroth · ‎06-23-2016

@roy p, if you want to route the data from Flume to WASB instead of HDFS, then I expect you can achieve that by changing the "hdfs:" URI to a "wasb:" URI. The full WASB URI will have an authority component that references an Azure Storage account and a container within that account. You can get the WASB URI by looking at configuration property fs.defaultFS in core-site.xml. If that doesn't work, then I recommend creating a new question specifically asking how to configure Flume to write to a file system different from HDFS. Please also apply the "flume" tag to the question. That will help get attention from Flume experts.

cnauroth · ‎06-22-2016

@roy p, in an HDInsight cluster, the default file system is WASB, which is a Hadoop-compatible file system backed by Azure Storage. The default file system is defined by property fs.defaultFS in core-site.xml. In an HDInsight cluster, you'll see this property set to a "wasb:" URI. When running Hadoop FileSystem Shell commands, if the path is not a qualified URI naming the scheme of the file system, then it assumes that you want the default file system. Thus, running "hadoop fs -ls /" shows results from the WASB file system as persisted in Azure Storage. HDInsight clusters also run a local instance of HDFS as a supplementary, non-default file system. For a file system that is not the default, the shell commands may reference paths in that file system by qualifying the URI with the scheme. Thus, running "hadoop fs -ls hdfs://mycluster/" shows results from the local HDFS file system, even though WASB is the default file system in an HDInsight cluster. Since the two commands reference paths on two different file systems, each containing its own set of files, the final results displayed are different.

cnauroth · ‎06-17-2016

@manichinnari555, I'm glad to hear this helped. I believe setting at table creation time should be sufficient.

cnauroth · ‎06-17-2016

@manichinnari555, I noticed that the table was stored as Parquet. HIVE-11401 is a known bug in Hive filtering on a partition column in Parquet. There is no immediate plan to bring this patch into HDP, but a known workaround is to disable predicate pushdown by setting property hive.optimize.index.filter to false.

cnauroth · ‎06-17-2016

@Payel Datta, I would not expect those permissions on the myid file to be a problem. Even though root is the owner, the permissions still allow read access to everyone. The ZooKeeper process only needs read access to that file. Have you tried any network troubleshooting with tools like NetCat, like I suggested in the last comment?

cnauroth · ‎06-17-2016

@Eric Periard, no, you are not going crazy. 🙂 You're correct that JIRA issue AMBARI-15235 is related. That's a change that helps on the display side. AMBARI-17603 is another patch that gets more at the root cause of the problem by optimizing the JMX query.

cnauroth · ‎06-16-2016

Summary HDFS Rolling Upgrade facilitates software upgrade of independent individual components in an HDFS cluster. During the upgrade window, HDFS will not physically delete blocks. Normal block deletion resumes after the administrator finalizes the upgrade. A common source of operational problems is forgetting to finalize an upgrade. If left unaddressed, HDFS will run out of storage capacity. Attempts to delete files will not free space. To avoid this problem, always finalize HDFS rolling upgrades in a timely fashion. This information applies to both Ambari Rolling & Express Upgrade. Rolling Upgrade Block Handling The high-level workflow of a rolling upgrade for the administrator is: Initiate rolling upgrade. Perform software upgrade on individual nodes. Run typical workloads and validate new software works. If validation is successful, finalize the upgrade. If validation discovers a problem, revert to the prior software via one of 2 options: Rollback - Restore prior software and restore cluster data to its pre-upgrade state. Downgrade - Restore prior software, but preserve data changes that occurred during the upgrade window. The Apache Hadoop documentation on HDFS Rolling Upgrade covers the specific commands in more detail. To satisfy the requirements of Rollback, HDFS will not delete blocks during a rolling upgrade window, which is the time between initiating the rolling upgrade and finalizing it. During this window, DataNodes handle block deletions by moving the blocks to a special directory named "trash" instead of physically deleting them. While the blocks reside in trash, they are not visible to clients performing reads. Thus, the files are logically deleted, but the blocks still consume physical space on the DataNode volumes. If the administrator chooses to rollback, the DataNodes restore these blocks from the trash directory to restore the cluster's data to its pre-upgrade state. After the upgrade is finalized, normal block deletion processing resumes. Blocks previously saved to trash will be physically deleted. New deletion activity will result in a physical delete, not moving the block to trash. Block deletion is asynchronous, so there may be propagation delays between the user deleting a file and the space being freed as reported by tools like "hdfs dfsadmin -report". Impact on HDFS Space Utilization An important consequence of this behavior is that during a rolling upgrade window, HDFS space utilization will rise continuously. Attempting to free space by deleting files will be ineffective, because the blocks will be moved to the trash directory instead of physically deleted. Please also note that this behavior applies not only to files that existed before the upgrade, but also new files created during the upgrade window. All deletes are handled by moving the blocks to trash. An administrator might notice that even after deleting a large amount of files, various tools continue to report high space consumption. This includes "hdfs dfsadmin -report", JMX metrics (which are consumed by Apache Ambari) and the NameNode web UI. If a cluster shows these symptoms, check if a rolling upgrade has not been finalized. There are multiple ways to check this. The "hdfs dfsadmin -rollingUpgrade query" command will report "Proceed with rolling upgrade", and the "Finalize Time" will be unspecified. > hdfs dfsadmin -rollingUpgrade query QUERY rolling upgrade ... Proceed with rolling upgrade: Block Pool ID: BP-1273075337-10.22.2.98-1466102062415 Start Time: Thu Jun 16 14:55:09 PDT 2016 (=1466114109053) Finalize Time: <NOT FINALIZED> The NameNode web UI will display a banner at the top stating "Rolling upgrade started". JMX metrics also expose "RollingUpgradeStatus", which will have a "finalizeTime" of 0 if the upgrade has not been finalized. > curl 'http://10.22.2.98:9870/jmx?qry=Hadoop:service=NameNode,name=NameNodeInfo' ... "RollingUpgradeStatus" : { "blockPoolId" : "BP-1273075337-10.22.2.98-1466102062415", "createdRollbackImages" : true, "finalizeTime" : 0, "startTime" : 1466114109053 }, ... DataNode Disk Layout This section explores the layout on disk for DataNodes that have logically deleted blocks during a rolling upgrade window. The following discussion uses a small testing cluster containing only one file. This shows a typical disk layout on a DataNode volume hosting exactly one block replica: data/dfs/data/current ├── BP-1273075337-10.22.2.98-1466102062415 │ ├── current │ │ ├── VERSION │ │ ├── dfsUsed │ │ ├── finalized │ │ │ └── subdir0 │ │ │ └── subdir0 │ │ │ ├── blk_1073741825 │ │ │ └── blk_1073741825_1001.meta │ │ └── rbw │ ├── scanner.cursor │ └── tmp └── VERSION The block file and its corresponding metadata file are in the "finalized" directory. If this file were deleted during a rolling upgrade window, then the block file and its corresponding metadata file would move to the trash directory: data/dfs/data/current ├── BP-1273075337-10.22.2.98-1466102062415 │ ├── RollingUpgradeInProgress │ ├── current │ │ ├── VERSION │ │ ├── dfsUsed │ │ ├── finalized │ │ │ └── subdir0 │ │ │ └── subdir0 │ │ └── rbw │ ├── scanner.cursor │ ├── tmp │ └── trash │ └── finalized │ └── subdir0 │ └── subdir0 │ ├── blk_1073741825 │ └── blk_1073741825_1001.meta └── VERSION As a reminder, block deletion activity in HDFS is asynchronous. It may take several minutes after running the "hdfs dfs -rm" command before the block moves from finalized to trash. One way to determine extra space consumption by logically deleted files is to run a "du" command on the trash directory. > du -hs data/dfs/data/current/BP-1273075337-10.22.2.98-1466102062415/trash 8.0K data/dfs/data/current/BP-1273075337-10.22.2.98-1466102062415/trash Assuming relatively even data distribution across nodes in the cluster, if this shows that a significant proportion of the volume's capacity is consumed by the trash directory, then that is a sign that the unfinalized rolling upgrade is the source of the space consumption. Conclusion Finalize those upgrades!

Online	Offline
Last Visited	‎01-13-2017 05:20 PM

Member Since	‎09-29-2015 10:51 PM
Last Visited	‎01-13-2017 05:20 PM
Posts	123
Kudos received	216

Cloudera Community

Re: How to debug the issue "IPC's epoch X is less ...

Re: Why hdfs://mycluster/ different from /

Re: querying a partition table

Re: NameNode HA Ambari Display Issue

Re: Tips for optimizing export to S3(n) ?

Re: HDFS Space Consumption During Rolling Upgrade

Re: Hi, am unable to install HDP via Ambari. Zooke...

Re: How to debug the issue "IPC's epoch X is less ...

Re: Why hdfs://mycluster/ different from /

Re: Why hdfs://mycluster/ different from /

Re: querying a partition table

Re: querying a partition table

Re: Hi, am unable to install HDP via Ambari. Zooke...

Re: NameNode HA Ambari Display Issue

HDFS Space Consumption During Rolling Upgrade