Member since
09-29-2015
123
Posts
215
Kudos Received
47
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3265 | 06-23-2016 06:29 PM | |
768 | 06-22-2016 09:16 PM | |
1581 | 06-17-2016 06:07 PM | |
713 | 06-16-2016 08:27 PM | |
1523 | 06-15-2016 06:44 PM |
01-13-2017
05:20 PM
@Rajit Saha, that's a nice bug fix! @Artem Ervits, I tried looking in the repo link that you gave, but I couldn't find a branch or a tag that I was certain matched up with the version number mentioned in the question. In theory, I'd expect the HDP-2.4.0.0-tag. https://github.com/hortonworks/hadoop-lzo_previous_to_gerrit/tree/HDP-2.4.0.0-tag However, the jar version is listed as 0.4.19 in the pom.xml file, so I'm not sure why we see 0.6.0 in the jar version mentioned in the question. Maybe the jar version is being overridden at HDP build time. To be absolutely certain we're working with the correct code version, it would be best to check with a Hortonworks release engineer. @Raja Aluri, what are your thoughts?
... View more
12-30-2016
07:02 PM
@Joshua Adeleke , there is a statement in the article that "block deletion activity in HDFS is asynchronous". This statement also applies when finalizing an upgrade. Since this processing happens asynchronously, it's difficult to put an accurate wall clock estimate on it. In my experience, I've generally seen the physical on-disk block deletions start happening 2-5 minutes after finalizing an upgrade.
... View more
10-24-2016
10:19 PM
@Peter Coates
, once the fs.s3a.proxy.host and fs.s3a.proxy.port settings are configured, S3A will configure the S3 client in the AWS SDK to route all HTTP requests through the proxy server. This is not set up in a way that differentiates write traffic vs. read traffic, so I can't think of any explanation for seeing different behavior for different kinds of operations. Is it possible that the proxy configuration hasn't been propagated to all relevant processes (Hive CLI, metastore, YARN containers, etc.), and it's just a coincidence that some of these processes do the writing and some of these processes do the reading?
... View more
08-28-2016
08:04 PM
@Davide Ferrari, correct, the command must be run as the same user that ran the target process. Sorry for my delayed response. I'm glad to see you worked through this anyway.
... View more
06-29-2016
03:22 PM
@Payel Datta, have you tried using NetCat to test if that can bind and listen on these ports, like I suggested in an earlier comment? "nc -l 2888" or "nc -l 3888". If a similar failure happens with NetCat, then that would confirm that you need to investigate further for some kind of networking problem at the hosts, though I don't know exactly what that networking problem would be from the information here.
... View more
06-23-2016
06:29 PM
3 Kudos
Hello @Xiaobing Zhou, This may indicate that either a NameNode or JournalNodes were unresponsive for a period of time. This can lead to a cascading failure, whereby a NameNode HA failover occurs, the other NameNode becomes active, the previous NameNode thinks it is still active, and then QJM rejects that NameNode for not operating within the same "epoch" (logical period of time). This is by design, as QJM is intended to prevent 2 NameNodes from mistakenly acting as active in a split-brain scenario. There are multiple potential reasons for unresponsiveness in the NameNode/JournalNode interaction. Reviewing logs from the NameNodes and JournalNodes would likely reveal more details. There are several common causes to watch for:
A long stop-the-world garbage collection pause may surpass the timeout threshold for the call. Garbage collection logging would show what kind of garbage collection activity the process is doing. You might also see log messages about the "JvmPauseMonitor". Consider reviewing the article NameNode Garbage Collection Configuration: Best Practices and Rationale to make sure your cluster's heap and garbage collection settings match best practices. In environments that integrate with LDAP for resolution of users' group memberships, load problems on the LDAP infrastructure can cause delays. In extreme cases, we have seen such timeouts at the JournalNodes cause edit logging calls to fail, which causes a NameNode abort and an HA failover. See Hadoop and LDAP: Usage, Load Patterns and Tuning for a more detailed description and potential mitigation steps. It is possible that there is a failure in network connectivity between the NameNode and the JournalNodes. This tends to be rare, because NameNodes and JournalNodes tend to be colocated on the same host or placed relatively close to one another in the network topology. Still, it is worth investigating that basic network connectivity between all NameNode hosts and all JournalNode hosts is working fine.
... View more
06-23-2016
06:23 PM
1 Kudo
@roy p, that error refers to Log4J custom hooks that are deployed onto the standard Hadoop classpath inside HDInsight clusters to integrate with Azure telemetry. I expect you can resolve this by editing log4j.properties to remove all references to the EtwAppender, and there will be no harm done. Alternatively, you could look for the relevant jars within the HDInsight cluster and copy them onto the classpath of the client machine. The names of those jar files are microsoft-log4j-*.jar, WindowsAzureETWSink-*.jar, and WindowsAzureTableSink-*.jar. I am not aware of anyone running these jars outside of an HDInsight cluster though. I don't know if they are tightly coupled to running within the HDInsight cluster, so it's unclear to me if external usage will work. (I mentioned that this relates to Azure telemetry. I don't know if an external client machine running outside the HDInsight cluster will have a routable network path to those telemetry services.)
Overall, I recommend resolving this by editing log4j.properties instead of trying to copy the jars.
... View more
06-23-2016
06:15 PM
3 Kudos
@Bruce Perez, HDFS does not allocate capacity separately based on user. However, it is possible to use HDFS Quotas to enforce a limit on metadata consumption and space consumption by specific directories. A common setup is to create sub-directories dedicated to different users, apply HDFS Permissions on each directory to guarantee that only that user can write to the directory, and then set an appropriate quota on each directory. The permissions would guarantee that the user can only write to their directory. The quotas would limit metadata and space consumption by each user. The overall effect of this setup is that in a multi-tenant cluster, it prevents any one user from consuming all space in the cluster and harming processes of its other users.
... View more
06-23-2016
04:12 PM
@roy p, if you want to route the data from Flume to WASB instead of HDFS, then I expect you can achieve that by changing the "hdfs:" URI to a "wasb:" URI. The full WASB URI will have an authority component that references an Azure Storage account and a container within that account. You can get the WASB URI by looking at configuration property fs.defaultFS in core-site.xml. If that doesn't work, then I recommend creating a new question specifically asking how to configure Flume to write to a file system different from HDFS. Please also apply the "flume" tag to the question. That will help get attention from Flume experts.
... View more
06-22-2016
09:16 PM
2 Kudos
@roy p, in an HDInsight cluster, the default file system is WASB, which is a Hadoop-compatible file system backed by Azure Storage. The default file system is defined by property fs.defaultFS in core-site.xml. In an HDInsight cluster, you'll see this property set to a "wasb:" URI. When running Hadoop FileSystem Shell commands, if the path is not a qualified URI naming the scheme of the file system, then it assumes that you want the default file system. Thus, running "hadoop fs -ls /" shows results from the WASB file system as persisted in Azure Storage. HDInsight clusters also run a local instance of HDFS as a supplementary, non-default file system. For a file system that is not the default, the shell commands may reference paths in that file system by qualifying the URI with the scheme. Thus, running "hadoop fs -ls hdfs://mycluster/" shows results from the local HDFS file system, even though WASB is the default file system in an HDInsight cluster. Since the two commands reference paths on two different file systems, each containing its own set of files, the final results displayed are different.
... View more
06-21-2016
08:26 PM
1 Kudo
@Jacek Dobrowolski, Apache Hadoop currently uses Apache Curator version 2.7.1, and HDP follows the same model. The fix for CURATOR-209 went into version 2.10.0. If you'd like to request that Hadoop use Curator 2.10.0 (or later), then I recommend filing an Apache JIRA to report the problem and request the upgrade. Please keep in mind that a dependency upgrade of Curator requires a lot of due diligence and testing to make sure it will remain backward-compatible for the broader Hadoop ecosystem. See HADOOP-11492 for an example of how a Curator upgrade was handled in the past.
... View more
06-21-2016
12:21 AM
@subacini balakrishnan, DistCp works by distributing the work of copying files across all of the nodes in a cluster. Have you noticed if these failures occur on specific nodes? If so, then it might indicate a misconfiguration or a network connectivity problem on those particular nodes.
... View more
06-19-2016
08:08 PM
@Saurabh Kumar, thank you for the clarification. I misinterpreted your original question. I have edited my answer to add more details about another form of authentication that is supported, named AltKerberos, and also information on how you might go about injecting your own custom authentication filter if completely custom logic is required.
... View more
06-17-2016
07:24 PM
@manichinnari555, I'm glad to hear this helped. I believe setting at table creation time should be sufficient.
... View more
06-17-2016
06:07 PM
2 Kudos
@manichinnari555, I noticed that the table was stored as Parquet. HIVE-11401 is a known bug in Hive filtering on a partition column in Parquet. There is no immediate plan to bring this patch into HDP, but a known workaround is to disable predicate pushdown by setting property hive.optimize.index.filter to false.
... View more
06-17-2016
04:55 PM
@Saurabh Kumar, Hadoop's HTTP servers can be configured to require Kerberos authentication via SPNEGO. After enabling that, users would be required to run kinit before curl, and they would have to use the curl options to enable SPNEGO and saving and reusing session cookies, e.g.: curl --negotiate -u : -b cookies.txt -c cookies.txt http://namenode:50070/webhdfs/v1/?op=LISTSTATUS For details on how to configure this, please refer to the Apache documentation on HTTP Authentication. Please note that enabling HTTP authentication is a separate configuration step from enabling Kerberos security in a cluster, as discussed in the documentation on Secure Mode. This means that even after enabling Kerberos security in a cluster, the HTTP servers will not demand user authentication by default. It would still be necessary to follow the separate steps for enabling HTTP authentication. If an alternate form of authentication is required for browser clients, different from Kerberos via SPNEGO, then it's possible to write a custom plugin that implements any arbitrary logic that you need. Quoting the HTTP Authentication guide: If a custom authentication mechanism is required for the HTTP web-consoles, it is possible to implement a plugin to support the alternate authentication mechanism (refer to Hadoop hadoop-auth for details on writing an AuthenticatorHandler). The Hadoop Auth documentation and its linked pages provide more details. The Configuration page is particularly relevant for its discussion of AltKerberos Configuration. This shows how you can require Kerberos authentication for some clients, but delegate to an alternate authentication mechanisms for others (typically browsers). AltKerberos still assumes that Kerberos is enabled. If you need completely custom logic, without any Kerberos dependency, then it probably requires looking at the AuthenticationFilterInitializer in Hadoop and using that as inspiration to write your own FilterInitializer, which injects its own custom filter. Another potential option, depending on your exact requirements, is to enforce perimeter security via firewall rules that block access to the HTTP port, unless the packets originate from a particular set of hosts. Then, limit login access to that set of hosts to a specific set of authorized users.
... View more
06-17-2016
04:29 PM
@Payel Datta, I would not expect those permissions on the myid file to be a problem. Even though root is the owner, the permissions still allow read access to everyone. The ZooKeeper process only needs read access to that file. Have you tried any network troubleshooting with tools like NetCat, like I suggested in the last comment?
... View more
06-17-2016
06:24 AM
6 Kudos
@Steve Steyic, if the alert states that the percentage of heap used is too high, then it's possible that this is just a natural consequence of the way the JVM manages the heap and reports its usage. It might not be a real operational problem. 0.5 M blocks on a DataNode is not necessarily too large. As objects are allocated, they'll fill the heap. Many objects turn out to be short-lived and get garbage collected relatively quickly. For longer-lived objects, they might not get cleared until the heap is very highly consumed and the JVM decides to trigger a full garbage collection. As far as measurement of the heap usage, this can look like a sawtooth pattern if you can imagine a plot of heap usage over time. Sometimes a JVM will just hover reporting a high proportion of its heap used. This can happen if there were a large number of long-lived objects that couldn't be collected quickly, but there isn't quite enough new object allocation activity happening right now to trigger the full garbage collection that would bring the reported usage back down. One way to test this theory is to trigger a full garbage collection manually on one of the DataNodes, such as by running this command on the DataNode's PID: jcmd <PID> GC.run If the reported heap usage drops significantly after running this command, then that validates the above theory. A possible way to "stay ahead" of this problem and do full garbage collections more frequently is to add the following garbage collection options to HADOOP_DATANODE_OPTS: -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=<percent> The typical value for <percent> is 70. In some cases, this may need tuning to find the optimal value, but in practice, I have seen 70 work well almost always. If you're interested in more background on garbage collection, then you might want to read NameNode Garbage Collection Configuration: Best Practices and Rationale. This article is admittedly focused on the NameNode instead of the DataNode, but much of the background information on garbage collection and configuration is applicable to any JVM process.
... View more
06-17-2016
05:54 AM
@Eric Periard, no, you are not going crazy. 🙂 You're correct that JIRA issue AMBARI-15235 is related. That's a change that helps on the display side. AMBARI-17603 is another patch that gets more at the root cause of the problem by optimizing the JMX query.
... View more
06-16-2016
11:41 PM
9 Kudos
Summary HDFS Rolling Upgrade facilitates software upgrade of independent individual components in an HDFS cluster. During the upgrade window, HDFS will not physically delete blocks. Normal block deletion resumes after the administrator finalizes the upgrade. A common source of operational problems is forgetting to finalize an upgrade. If left unaddressed, HDFS will run out of storage capacity. Attempts to delete files will not free space. To avoid this problem, always finalize HDFS rolling upgrades in a timely fashion. This information applies to both Ambari Rolling & Express Upgrade. Rolling Upgrade Block Handling The high-level workflow of a rolling upgrade for the administrator is:
Initiate rolling upgrade. Perform software upgrade on individual nodes. Run typical workloads and validate new software works. If validation is successful, finalize the upgrade. If validation discovers a problem, revert to the prior software via one of 2 options:
Rollback - Restore prior software and restore cluster data to its pre-upgrade state. Downgrade - Restore prior software, but preserve data changes that occurred during the upgrade window. The Apache Hadoop documentation on HDFS Rolling Upgrade covers the specific commands in more detail. To satisfy the requirements of Rollback, HDFS will not delete blocks during a rolling upgrade window, which is the time between initiating the rolling upgrade and finalizing it. During this window, DataNodes handle block deletions by moving the blocks to a special directory named "trash" instead of physically deleting them. While the blocks reside in trash, they are not visible to clients performing reads. Thus, the files are logically deleted, but the blocks still consume physical space on the DataNode volumes. If the administrator chooses to rollback, the DataNodes restore these blocks from the trash directory to restore the cluster's data to its pre-upgrade state. After the upgrade is finalized, normal block deletion processing resumes. Blocks previously saved to trash will be physically deleted. New deletion activity will result in a physical delete, not moving the block to trash. Block deletion is asynchronous, so there may be propagation delays between the user deleting a file and the space being freed as reported by tools like "hdfs dfsadmin -report". Impact on HDFS Space Utilization An important consequence of this behavior is that during a rolling upgrade window, HDFS space utilization will rise continuously. Attempting to free space by deleting files will be ineffective, because the blocks will be moved to the trash directory instead of physically deleted. Please also note that this behavior applies not only to files that existed before the upgrade, but also new files created during the upgrade window. All deletes are handled by moving the blocks to trash. An administrator might notice that even after deleting a large amount of files, various tools continue to report high space consumption. This includes "hdfs dfsadmin -report", JMX metrics (which are consumed by Apache Ambari) and the NameNode web UI. If a cluster shows these symptoms, check if a rolling upgrade has not been finalized. There are multiple ways to check this. The "hdfs dfsadmin -rollingUpgrade query" command will report "Proceed with rolling upgrade", and the "Finalize Time" will be unspecified. > hdfs dfsadmin -rollingUpgrade query
QUERY rolling upgrade ...
Proceed with rolling upgrade:
Block Pool ID: BP-1273075337-10.22.2.98-1466102062415
Start Time: Thu Jun 16 14:55:09 PDT 2016 (=1466114109053)
Finalize Time: <NOT FINALIZED> The NameNode web UI will display a banner at the top stating "Rolling upgrade started". JMX metrics also expose "RollingUpgradeStatus", which will have a "finalizeTime" of 0 if the upgrade has not been finalized. > curl 'http://10.22.2.98:9870/jmx?qry=Hadoop:service=NameNode,name=NameNodeInfo'
...
"RollingUpgradeStatus" : {
"blockPoolId" : "BP-1273075337-10.22.2.98-1466102062415",
"createdRollbackImages" : true,
"finalizeTime" : 0,
"startTime" : 1466114109053
},
... DataNode Disk Layout This section explores the layout on disk for DataNodes that have logically deleted blocks during a rolling upgrade window. The following discussion uses a small testing cluster containing only one file. This shows a typical disk layout on a DataNode volume hosting exactly one block replica: data/dfs/data/current
├── BP-1273075337-10.22.2.98-1466102062415
│ ├── current
│ │ ├── VERSION
│ │ ├── dfsUsed
│ │ ├── finalized
│ │ │ └── subdir0
│ │ │ └── subdir0
│ │ │ ├── blk_1073741825
│ │ │ └── blk_1073741825_1001.meta
│ │ └── rbw
│ ├── scanner.cursor
│ └── tmp
└── VERSION The block file and its corresponding metadata file are in the "finalized" directory. If this file were deleted during a rolling upgrade window, then the block file and its corresponding metadata file would move to the trash directory: data/dfs/data/current
├── BP-1273075337-10.22.2.98-1466102062415
│ ├── RollingUpgradeInProgress
│ ├── current
│ │ ├── VERSION
│ │ ├── dfsUsed
│ │ ├── finalized
│ │ │ └── subdir0
│ │ │ └── subdir0
│ │ └── rbw
│ ├── scanner.cursor
│ ├── tmp
│ └── trash
│ └── finalized
│ └── subdir0
│ └── subdir0
│ ├── blk_1073741825
│ └── blk_1073741825_1001.meta
└── VERSION As a reminder, block deletion activity in HDFS is asynchronous. It may take several minutes after running the "hdfs dfs -rm" command before the block moves from finalized to trash. One way to determine extra space consumption by logically deleted files is to run a "du" command on the trash directory. > du -hs data/dfs/data/current/BP-1273075337-10.22.2.98-1466102062415/trash
8.0K data/dfs/data/current/BP-1273075337-10.22.2.98-1466102062415/trash Assuming relatively even data distribution across nodes in the cluster, if this shows that a significant proportion of the volume's capacity is consumed by the trash directory, then that is a sign that the unfinalized rolling upgrade is the source of the space consumption. Conclusion Finalize those upgrades!
... View more
- Find more articles tagged with:
- FAQ
- Hadoop Core
- HDFS
- operations
- upgrade
Labels:
06-16-2016
08:27 PM
1 Kudo
@Eric Periard, there is a known issue right now in the way Ambari determines HA status for the NameNodes. Ambari uses a JMX query to each NameNode. The current implementation of that query fetches more data than is strictly necessary for checking HA status, and this can cause delays in processing that query. The symptom of this is that the Ambari UI will misreport the active/standby status of the NameNodes as you described. The problem is intermittent, so a browser refresh is likely to show correct behavior. There is a fix in development now for Ambari to use a lighter-weight JMX query that won't be to prone to this problem. This does not indicate a problem with the health of HDFS. As you noted, users are still able to read and write files. The problem is limited to the reporting of HA status displayed in Ambari.
... View more
06-16-2016
08:03 PM
@Zack Riesland, the S3N file system will buffer to a local disk area first before flushing data to the S3 bucket. I suspect that depending on the amount of concurrent copy activity happening on the node (number of DistCp mapper tasks actively copying to S3N concurrently), you might hit the limit of available disk space for that buffering. The directory used by S3N for this buffering is configurable via property fs.s3.buffer.dir in core-site.xml. See below for the full specification of that property and its default value. I recommend reviewing this in your cluster to make sure that it's configured to point to a large enough volume to support the workload. You can specify a comma-separated list of multiple paths too if you want to use multiple disks. <property>
<name>fs.s3.buffer.dir</name>
<value>${hadoop.tmp.dir}/s3</value>
<description>Determines where on the local filesystem the s3:/s3n: filesystem
should store files before sending them to S3
(or after retrieving them from S3).
</description>
</property>
... View more
06-16-2016
07:14 PM
@Paul Tader, the NameNode continues to track state on the decommissioned (and now dead) node for reporting purposes. As you noticed, it continues to show in the output of "hdfs dfsdamin -report", JMX metrics and the NameNode web UI. Ambari consumes the JMX metrics. This is by design to support certain use cases. One example is decommission, followed by a lengthy maintenance window such as an OS upgrade, followed by recommission. By continuing to report the decommissioned node, we can give the administrator a list of nodes that might still need further maintenance work and subsequent recommission. I believe the only way to clear the NameNode's state tracking for the decommissioned node completely would be to restart the NameNode process. In an HA cluster, this could be achieved with a failover to the other NameNode during the restart for zero downtime. In a non-HA cluster, this would likely have to be part of a planned maintenance window with known downtime. Perhaps a good feature enhancement for HDFS would be something like a "hdfs dfsadmin -clearDecommissionedNodes" command to clear the list of dead decommissioned nodes while the NameNode process remains running. We could consider that.
... View more
06-16-2016
06:25 PM
@Girish Chaudhari, I think you have a viable theory: It looks to me presto nodes doesn't have access to datanodes as my hadoop cluster is different than presto cluster. Any kind of help would be greatly appreciated. From the stack trace, I can tell that if the HDFS client reached this point, then it must have successfully connected to the NameNode. However, the client (the Presto node in this case) also must be able to connect to any DataNode in the HDFS cluster to be able to read or write block data. The DataNode connection is performed on the data transfer port, configured as dfs.datanode.address in hdfs-site.xml, and its default value is 50010. I recommend some basic network troubleshooting. Login to the Presto nodes. Try Netcat or Telnet to the data transfer port on each of the DataNodes. If that fails, then this is a network connectivity problem specific to your environment. Check that there is a routable path from the Presto nodes to the DataNodes, and check for firewall rules that might block access to the data transfer port.
... View more
06-16-2016
04:13 PM
@Payel Datta, thank you for sharing the full stack trace. I expect this will turn out to be some kind of misconfiguration, either of the host network settings or of ZooKeeper's connection configuration. On the ZooKeeper side, I recommend reviewing the zoo.cfg files and the myid files. On each host, the myid file must match up correctly with the address settings in zoo.cfg. For example, on the node with myid=1, look in zoo.cfg for the server.1 settings. Make sure those settings have the correct matching host or IP address. Perhaps the addresses in zoo.cfg do not match correctly with the network interface on the host. If the settings in zoo.cfg refer to a hostname/IP address for which the host does not have a listening network interface, then the bind won't be able to succeed. On the networking side, you might try using basic tools like NetCat to see if it's possible to set up a listening server bound to port 3888. If that succeeds, then it's likely not a host OS networking problem. If it fails though, then that's worth further investigation on the networking side, independent of ZooKeeper.
... View more
06-16-2016
02:11 PM
1 Kudo
@Kaliyug Antagonist, this sounds like after the reboot, something caused the DataNodes to fail sending block reports to the NameNode. The NameNode usually waits until sufficient block reports have arrived before leaving safe mode. By running "hdfs dfsadmin -safemode leave", it forced an exit from safe mode without receipt of the necessary block reports. For next steps, I recommend reviewing the logs from the NameNode and a sampling of the DataNodes. If there is an error in block reporting, then I expect you'll find corresponding log messages that explain the problem in more detail. Perhaps the DataNodes are failing to connect to the NameNode's RPC server port.
... View more
06-15-2016
09:30 PM
@Zack Riesland, yes, there is a -bandwidth option. For full documentation of the available command line options, refer to the Apache documentation on DistCp.
... View more
06-15-2016
07:11 PM
@Zack Riesland, your understanding of DistCp is correct. It performs a raw byte-by-byte copy from the source to the destination. If that data is compressed ORC at the source, then that's what it will be at the destination too.
According to AWS blog posts, Elastic MapReduce does support use of ORC. This is not a scenario I have tested myself though. I'd recommend a quick prototyping end-to-end test to make sure it meets your requirements: DistCp a small ORC data set to S3, and then see if you can query it successfully from EMR.
... View more
06-15-2016
06:49 PM
@Payel Datta, you won't need to declare the leader explicitly. The ZooKeeper ensemble negotiates a leader node automatically by itself. Do you have more details on that bind exception? Is it possible that something else on the host is already using that port?
... View more
06-15-2016
06:44 PM
2 Kudos
@Zack Riesland, have you considered trying DistCp to copy the raw files from a source hdfs: URI to a destination s3n: or s3a: URI? It's possible this would be able to move the data more quickly than the Hive insert into/select from. If it's still important to have Hive metadata referencing the table at the s3n: or s3a: location, then you could handle that by creating an external table after completion of the DistCp.
... View more