Member since
02-09-2016
40
Posts
14
Kudos Received
0
Solutions
07-18-2016
02:56 PM
Hi, Would it be possible to provide hdfs privileges for users defined on the cluster? For example, my-env-hdfs is a user we have on our cluster. Can I grant hdfs user level privileges to this user? If possible, how to do it? Likewise, how about other services like yarn, hive, ambari-qa etc? Thanks
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache YARN
06-29-2016
11:14 AM
Thanks @Benjamin Leonhardi Further to my question, what is the best strategy to remove old log files? Can I simply remove all the logs apart from the "current" ones without any issues? Is there any best practice around log management? Thanks
... View more
06-29-2016
10:28 AM
Any idea on what are the various log files that are typically created under /var/log/hadoop/* folders? Is there a defined naming convention and mapping to hadoop deamons? The reason I ask is I see many files listed under /var/log/hadoop/hdfs folder, but don't understand / can't find documentation on what is the purpose of each log file. Any help please.
... View more
Labels:
06-29-2016
10:14 AM
1 Kudo
Hi, Being a novice, I am trying to understand answers to the below questions? 1. what is the difference of having configuration defined in hadoop-env.sh vs defining it hdfs-site.xml or yarn-site.xml? 2. My presumption is *-default.xml files will have the standard Apache defined configuration values and any custom values for the standard properties (either Hadoop vendor specific like Hortonworks / Cloudera or implementation specific at a project level) will be defined in the *-site.xml files. Am I correct in my understanding? 3. What is the difference of /usr/hdp/current and /usr/hdp/2.4.0.0.169 folders on Sandbox? What is the importance/ significance of each of these folders? Are they both required even on production deployments?
... View more
Labels:
05-25-2016
01:43 PM
4 Kudos
I have
some questions around HDFS snapshots which can be used for backup and DR purposes.
How does snapshots help for
Disaster Recovery? What are the best practices around using snapshots for
DR purposes? Especially trying to understand when data is directly stored on HDFS,
Hive data and HBase data
Can a directory be deleted
using hdfs dfs -rmr -skipTrash /data/snapshot-dir? Or is it that all the
snapshots have to be deleted first and then the snapshotting be disabled
before allowing the directory be deleted?
As I understand, no data is
copied for snapshots, but only metadata is maintained for the blocks
added/ modified / deleted. If
that’s the case, just wondering what happens when the comamnd hdfs dfs -rm
/data/snapshot-dir/file1 is run. Will the file be moved to the trash? If
so, will the snapshot maintain the reference to the entry in trash? Will
trach eviction has any impact in this case?
What happens when one of the
sub-directory under the snapshot directory is deleted? For example, if the
command hdfs dfs -rmr -skipTrash /data/sub-dir is run? Can the data be recovered from
snapshots?
Can snapshots be deleted /
archived automatically based on policies, for example time-based? In the
above example, how long will the sub-dir data be maintained in the
snapshot?
How does snapshots work along
with HDFS quotas. For example, assume a directory with a quota of 1GB with
snapshotting enabled. Assume the directory is closer to its full quota and
a user deleted a large file to store some other dataset. Will the new data
be allowed to be saved to the directory or will the operation be stopped
because the quota limits have been exceeded? Apologies
if some of the questions doesn’t make sense. I am still trying to understand
these concepts at a ground level.
... View more
Labels:
- Labels:
-
Apache Hadoop
04-27-2016
09:15 PM
@Abdelkrim Hadjidj great explanation, thanks! With the above understanding, I presume, if need be, there can be a mix of both physical and virtual in the same cluster without any additional overhead / performance impacts apart from the ones mentioned above?
... View more
04-27-2016
08:56 PM
2 Kudos
Hello, We are having an internal argument on whether its a good idea to have the cluster mainly running on VMs or is it better to have it on physical servers. What are the pros and cons of each hardware configuration. Also, is it a good idea to mix both physical and virtual machines in a single cluster, if need be.
... View more
Labels:
- Labels:
-
Apache Hadoop
02-09-2016
11:14 PM
@Neeraj Sabharwal Thanks, it does help to better my understanding. For DR, apart from storage, how to ensure all the job history information, workflow schedules, configuration information etc are recovered on the DR cluster, when my active cluster is down? For configuration info etc, backups would help probably, but still there would be some missing blocks to recover from if the active cluster is not covered fully from DR perspective.
... View more
- « Previous
- Next »