About ravi1

ravi1 · ‎05-30-2016

Please take a look at this post https://community.hortonworks.com/questions/2349/tip-when-you-get-a-message-in-job-log-user-dr-who.html There are different ways to fix this issue, one of which is to put hadoop.http.staticuser.user=yarn in core-site.xml. More details in the linked thread.

ravi1 · ‎05-30-2016

It looks like you have issues with your Resource Manager1 and your app is trying to switch over to Rresource Manager2. You can take a look at RM1 logs to see what the error is. It is either down or not responding on 8032 on time.

ravi1 · ‎05-30-2016

Check if you can passwordless ssh to the same host (hadoop1) using the key. I believe you set this from hadoop1 to connect to hadoop2, hadoop3 and hadoop4 but not hadoop1 itself. If this does not work, you can try manual registration of ambari using http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.2.0/bk_ambari_reference_guide/content/ch_amb_ref_installing_ambari_agents_manually.html

ravi1 · ‎05-29-2016

1) You can use base permissions on HDFS and give any additional permissions using ranger. So, in case of /data, you can start with 750, and if anyone in the group needs write permission, you can add it using a ranger policy. 3) User will have access. As I said in 1, you can put minimum permissions in HDFS and add additional permissions using ranger. 4) You can still access this directly if Hive has doAs and you are accessing from hiveserver2. This is the reason why you may have to duplicate access restrictions both on HDFS and hive columns if you have access from Hive CLI and Hiveserver2. Almost similar case with hbase. 5) As in 1, you can put minimal permission on HDFS and then add additional permissions using ranger. Which means, you could go with 700 too, but that will add more overhead on creating policies.

ravi1 · ‎05-29-2016

Please post the errors that you are seeing, both on ambari UI and ambari-server.log. ambari-agent logs also might provide some insight into what is going on. Have you also executed 'ambari-server reset' before reinstall?

ravi1 · ‎05-27-2016

Whenever you are trying to evaluate Hive on Tez vs any other tool and this is for data analytics (with row level update/access patterns), my suggestion is to start with hive, use all the right tunings at OS, Cluster and Hive level, use ORC, bloom filters, organize your data and see if query times hit your SLA. We have a seen at a lot of places that once they tune hive correctly and move away from text files they will hit SLAs. You can then look at other tools if and when your SLAs are not met.

ravi1 · ‎05-26-2016

Yes. Thats correct.

ravi1 · ‎05-26-2016

Try setting on SparkContext like below. This works for file loads, and I believe should work for hive table load as well sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")

ravi1 · ‎05-26-2016

Teeing vs Copying- Which one is preferred over the other? Understand its scenario dependent. But which has better adaptability and more widely used in the industry? Copying? With Teeing, you can split up primary tasks between the 2 clusters and use the other cluster as DR for that task. As an example, if you have clusters C1 and C2, you can use C1 as primary cluster and C2 as DR for some teams/tasks and use C2 as primary cluster and C1 as DR for some other users/tasks Is it necessary to have both the main and the DR cluster on the same version of HDP? If not, what are things to consider if same version is not possible? It is convinent to have them both on same version. This is especially the case if you want to use DR with almost no code changes if primary server is down. Should it be like for like topology between clusters in terms of component placement including gateway nodes and zookeeper services? This is not required. How does security play out for DR? Should both the cluster nodes be part of the same Kerberos realm or can they be part of different realms? As a DR, same realm is a lot easier to manage than cross realm. But cross realm is possible. Can the replication factor be lower? Or it recommended to maintain it as the same as the primary cluster? I have seen using rep factor 2 on DR clusters, but in case this turns in primary after disaster you may have to change rep factor to 3 on all data sets. Any specific network requirements in terms of latency, speed etc. between the clusters For ditscp, each node one cluster should communicate with each of the other nodes on second cluster. Is there a need to run balancer on the DR cluster periodically? Yes. Always good to run balancer to keep similar number of blocks across nodes. How does encryption play out between the primary and DR clusters? If encryption at rest is enabled in the primary one, how is it handled in the DR cluster? What are the implications of wire-encryption while transferring the data between the clusters? Wire encyprtion will slow down transfers a little bit. When HDFS snapshots is enabled on the primary cluster, how does it work when data is being synced to the DR cluster? Can Snapshots be exported onto another cluster? I understand this is possible for HBase snapshots. But is it allowed in HDFS case? For example, if a file is deleted on the primary cluster, but available in the snapshot, will that be synced to the snapshot directory on the DR cluster? If you are using snapshots, you can simply use distcp on snapshots instead of actual data set. For services which involve databases (Hive, Oozie, Ambari), instead of backing up periodically from the primary cluster to the DR cluster, is it recommended to setup one HA master in the DR cluster directly? I don't think automating ambari is a good idea. Configs don't change that much so a simple process of duplicating might be better. Backing up would mean you need to have same hostnames and same topology. For hive, instead of complete backup, Falcon can take care of table level replication. For configurations and application data, instead of backing up at regular intervals, is there a way to keep them in sync between the primary and DR clusters? Not sure where your application data resides, but for configuration since everything is managed by ambari, you can need to keep ambari configuration in sync.

ravi1 · ‎05-26-2016

Take a look at https://community.hortonworks.com/articles/25523/hdp-240-and-spark-160-connecting-to-aws-s3-buckets.html which gives details on how to access S3 from spark.

Online	Offline
Last Visited	‎12-18-2021 05:54 PM

Member Since	‎01-09-2019 05:01 PM
Last Visited	‎12-18-2021 05:54 PM
Posts	401
Kudos received	163

Cloudera Community

Re: 2 hosts not running master services

Re: ambari restart and service restart updating kr...

Re: How to automate sqoop incremental import using...

Re: Path to core-site.xml in sandbox?

Re: Curious to know why majority of the people are...

Re: HDP2.4.2 Spark jobs log can't be access

Re: Exception while invoking getNewApplication of ...

Re: Ambari server installation fails at Confirm Ho...

Re: Ranger Recommendation

Re: Unable to proceed with installation even after...

Re: HAWQ VS HIVE

Re: Which is better to create Hadoop accounts in L...

Re: Set hive parameter in sparksql?

Re: Questions on Disaster Recovery

Re: Spark on S3