About bigdata_superno

bigdata_superno · ‎07-18-2016

Hi, Would it be possible to provide hdfs privileges for users defined on the cluster? For example, my-env-hdfs is a user we have on our cluster. Can I grant hdfs user level privileges to this user? If possible, how to do it? Likewise, how about other services like yarn, hive, ambari-qa etc? Thanks

bigdata_superno · ‎06-29-2016

Perfect, Thanks!!

bigdata_superno · ‎06-29-2016

Thanks @Benjamin Leonhardi Further to my question, what is the best strategy to remove old log files? Can I simply remove all the logs apart from the "current" ones without any issues? Is there any best practice around log management? Thanks

bigdata_superno · ‎06-29-2016

Any idea on what are the various log files that are typically created under /var/log/hadoop/* folders? Is there a defined naming convention and mapping to hadoop deamons? The reason I ask is I see many files listed under /var/log/hadoop/hdfs folder, but don't understand / can't find documentation on what is the purpose of each log file. Any help please.

bigdata_superno · ‎06-29-2016

Hi, Being a novice, I am trying to understand answers to the below questions? 1. what is the difference of having configuration defined in hadoop-env.sh vs defining it hdfs-site.xml or yarn-site.xml? 2. My presumption is *-default.xml files will have the standard Apache defined configuration values and any custom values for the standard properties (either Hadoop vendor specific like Hortonworks / Cloudera or implementation specific at a project level) will be defined in the *-site.xml files. Am I correct in my understanding? 3. What is the difference of /usr/hdp/current and /usr/hdp/2.4.0.0.169 folders on Sandbox? What is the importance/ significance of each of these folders? Are they both required even on production deployments?

bigdata_superno · ‎05-25-2016

I have some questions around HDFS snapshots which can be used for backup and DR purposes. How does snapshots help for Disaster Recovery? What are the best practices around using snapshots for DR purposes? Especially trying to understand when data is directly stored on HDFS, Hive data and HBase data Can a directory be deleted using hdfs dfs -rmr -skipTrash /data/snapshot-dir? Or is it that all the snapshots have to be deleted first and then the snapshotting be disabled before allowing the directory be deleted? As I understand, no data is copied for snapshots, but only metadata is maintained for the blocks added/ modified / deleted. If that’s the case, just wondering what happens when the comamnd hdfs dfs -rm /data/snapshot-dir/file1 is run. Will the file be moved to the trash? If so, will the snapshot maintain the reference to the entry in trash? Will trach eviction has any impact in this case? What happens when one of the sub-directory under the snapshot directory is deleted? For example, if the command hdfs dfs -rmr -skipTrash /data/sub-dir is run? Can the data be recovered from snapshots? Can snapshots be deleted / archived automatically based on policies, for example time-based? In the above example, how long will the sub-dir data be maintained in the snapshot? How does snapshots work along with HDFS quotas. For example, assume a directory with a quota of 1GB with snapshotting enabled. Assume the directory is closer to its full quota and a user deleted a large file to store some other dataset. Will the new data be allowed to be saved to the directory or will the operation be stopped because the quota limits have been exceeded? Apologies if some of the questions doesn’t make sense. I am still trying to understand these concepts at a ground level.

bigdata_superno · ‎04-27-2016

@Ravi Mutual, perfect! Thanks!

bigdata_superno · ‎04-27-2016

@Abdelkrim Hadjidj great explanation, thanks! With the above understanding, I presume, if need be, there can be a mix of both physical and virtual in the same cluster without any additional overhead / performance impacts apart from the ones mentioned above?

bigdata_superno · ‎04-27-2016

Hello, We are having an internal argument on whether its a good idea to have the cluster mainly running on VMs or is it better to have it on physical servers. What are the pros and cons of each hardware configuration. Also, is it a good idea to mix both physical and virtual machines in a single cluster, if need be.

bigdata_superno · ‎02-09-2016

@Neeraj Sabharwal Thanks, it does help to better my understanding. For DR, apart from storage, how to ensure all the job history information, workflow schedules, configuration information etc are recovered on the DR cluster, when my active cluster is down? For configuration info etc, backups would help probably, but still there would be some missing blocks to recover from if the active cluster is not covered fully from DR perspective.

Online	Offline
Last Visited	‎10-31-2018 10:41 AM

Member Since	‎02-09-2016 10:04 PM
Last Visited	‎10-31-2018 10:41 AM
Posts	40
Kudos received	14

Cloudera Community

hdfs privileges to other user-defined users

Re: Query on Hadoop logs

Re: Query on Hadoop logs

Query on Hadoop logs

Introductory Hadoop queries

Snapshots, Backup and DR

Re: Virtual Machines in Hadoop cluster

Re: Virtual Machines in Hadoop cluster

Virtual Machines in Hadoop cluster

Re: General BigData Questions