Member since
05-09-2016
421
Posts
54
Kudos Received
32
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2607 | 04-22-2022 11:31 AM | |
2411 | 01-20-2022 11:24 AM | |
2247 | 11-23-2021 12:53 PM | |
2954 | 02-07-2018 12:18 AM | |
4831 | 06-08-2017 09:13 AM |
06-08-2016
01:43 AM
2 Kudos
I am sharing this again with the configurations and prerequisites which were missing from above example. This was done using HDP version 2.3.4 on both source and target cluster. Pre-requisites : Source and target clusters should be up and running with required services. HDP stack services required on the clusters are HDFS, Yarn, Falcon, Hive, Oozie, Pig, Tez and Zookeeper as seen below: Staging and working directories should be present on hdfs with falcon users: To do that run below commands on both source and target clusters as falcon user. [falcon@src1 ~]$ hdfs dfs -mkdir /apps/falcon/staging /apps/falcon/working
[falcon@src1 ~]$ hdfs dfs -chmod 777 /apps/falcon/staging Create source and target cluster entity using falcon CLI as below : For source cluster: falcon entity -type cluster -submit -file source-cluster.xml source-cluster.xml : <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cluster name="source" description="primary" colo="primary" xmlns="uri:falcon:cluster:0.1">
<tags>EntityType=Cluster</tags>
<interfaces>
<interface type="readonly" endpoint="hdfs://src-nameNode:8020" version="2.2.0"/>
<interface type="write" endpoint="hdfs://src-nameNode:8020" version="2.2.0"/>
<interface type="execute" endpoint="src-resourceManager:8050" version="2.2.0"/>
<interface type="workflow" endpoint="http://src-oozieServer:11000/oozie/" version="4.0.0"/>
<interface type="messaging" endpoint="tcp://src-falconServer:61616?daemon=true" version="5.1.6"/>
<interface type="registry" endpoint="thrift://src-hiveMetaServer:9083" version="1.2.1" />
</interfaces>
<locations>
<location name="staging" path="/apps/falcon/staging"/>
<location name="temp" path="/tmp"/>
<location name="working" path="/apps/falcon/working"/>
</locations>
<ACL owner="ambari-qa" group="users" permission="0x755"/>
</cluster>
For target cluster: falcon entity -type cluster -submit -file target-cluster.xml target-cluster.xml : <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cluster name="target" description="target" colo="backup" xmlns="uri:falcon:cluster:0.1">
<tags>EntityType=Cluster</tags>
<interfaces>
<interface type="readonly" endpoint="hdfs://tgt-nameNode:8020" version="2.2.0"/>
<interface type="write" endpoint="hdfs://tgt-nameNode:8020" version="2.2.0"/>
<interface type="execute" endpoint="tgt-resouceManager:8050" version="2.2.0"/>
<interface type="workflow" endpoint="http://tgt-oozieServer:11000/oozie/" version="4.0.0"/>
<interface type="messaging" endpoint="tcp://tgt-falconServer:61616?daemon=true" version="5.1.6"/>
<interface type="registry" endpoint="thrift://tgt-hiveMetaServer:9083" version="1.2.1" />
</interfaces>
<locations>
<location name="staging" path="/apps/falcon/staging"/>
<location name="temp" path="/tmp"/>
<location name="working" path="/apps/falcon/working"/>
</locations>
<ACL owner="ambari-qa" group="users" permission="0x755"/>
</cluster>
Create source db and table and insert some data : Run on source cluster's hive: create database landing_db;
use landing_db;
CREATE TABLE summary_table(id int, value string) PARTITIONED BY (ds string);
----------------------------------------------------------------------------------------
ALTER TABLE summary_table ADD PARTITION (ds = '2014-01');
ALTER TABLE summary_table ADD PARTITION (ds = '2014-02');
ALTER TABLE summary_table ADD PARTITION (ds = '2014-03');
----------------------------------------------------------------------------------------
insert into summary_table PARTITION(ds) values (1,'abc1',"2014-01");
insert into summary_table PARTITION(ds) values (2,'abc2',"2014-02");
insert into summary_table PARTITION(ds) values (3,'abc3',"2014-03");
Create target db and table. Run on target cluster's hive: create database archive_db;
use archive_db;
CREATE TABLE summary_archive_table(id int, value string) PARTITIONED BY (ds string); Submit feed entity (do not schedule): falcon entity -type feed -submit -file replication-feed.xml replication-feed.xml : <?xml version="1.0" encoding="UTF-8"?>
<feed description="Monthly Analytics Summary" name="replication-feed"xmlns="uri:falcon:feed:0.1">
<tags>EntityType=Feed</tags>
<frequency>months(1)</frequency>
<clusters>
<cluster name="source" type="source">
<validity start="2014-01-01T00:00Z" end="2015-03-31T00:00Z"/>
<retention limit="months(36)" action="delete"/>
</cluster>
<cluster name="target" type="target">
<validity start="2014-01-01T00:00Z" end="2015-03-31T00:00Z"/>
<retention limit="months(180)" action="delete"/>
<table uri="catalog:archive_db:summary_archive_table#ds=${YEAR}-${MONTH}" />
</cluster>
</clusters>
<table uri="catalog:landing_db:summary_table#ds=${YEAR}-${MONTH}" />
<ACL owner="falcon" />
<schema location="hcat" provider="hcat"/>
</feed> This example Feed entity below demonstrates the following:
Cross-cluster replication of a Data Set The native use of a Hive/HCatalog table in Falcon The definition of a separate retention policy for the source and target tables in replication. Make sure all oozie servers that falcon talks to has the hadoop configs configured in oozie-site.xml For example in my case I have added below in my target cluster's oozie: <property>
<name>oozie.service.HadoopAccessorService.hadoop.configurations</name>
<value>*=/etc/hadoop/conf,src-namenode:8020=/etc/src_hadoop/conf,src-resourceManager:8050=/etc/src_hadoop/conf</value>
<description>Comma separated AUTHORITY=HADOOP_CONF_DIR, where AUTHORITY is the HOST:PORT of the Hadoop service (JobTracker, HDFS). The wildcard '*' configuration is used when there is no exact match for an authority. The HADOOP_CONF_DIR contains the relevant Hadoop *-site.xml files. If the path is relative is looked within the Oozie configuration directory; though the path can be absolute (i.e. to point to Hadoop client conf/ directories in the local filesystem.</description>
</property>
Here /etc/src_hadoop/conf is configuration files(/etc/hadoop/conf) copied from source cluster to target cluster's oozie server. Also ensure that from your target cluster, oozie can submit jobs in source cluster. This can be done by setting below property: Finally schedule the feed as below which will submit oozie co-ordinaton on target cluster. falcon entity -type feed -schedule -name replication-feed
... View more
Labels:
06-06-2016
03:41 AM
@Michael Dennis "MD" Uanang Which version of ambari you are using? Are you running ambari on bare metal or vm? Can you share your ambari.properties file? Also please paste the output of netstat -plant|grep <pid_ambari_server>
... View more
05-31-2016
09:43 AM
1 Kudo
Hi All, I discovered that the issue was with my below configuration: <name>oozie.service.HadoopAccessorService.hadoop.configurations</name><value>*=/etc/hadoop/conf,nr1.hwxblr.com:8020=/etc/primary_conf/conf,nr3.hwxblr.com:8030=/etc/primary_conf/conf,nr21.hwxblr.com:8020=/etc/hadoop/conf,nr23.hwxblr.com:8030=/etc/hadoop/conf</value> Instead of 8030 port it should be 8050. Thanks @Kuldeep Kulkarni for finding this.
... View more
05-31-2016
06:57 AM
1 Kudo
Hi,
I am trying https://falcon.apache.org/HiveIntegration.html with HDP 2.3.4 in both source and target cluster. There is know limitation mentioned in the link as follows: Oozie 4.x with Hadoop-2.x Replication jobs are submitted to oozie on the destination cluster. Oozie runs a table export job on RM on source cluster. Oozie server on the target cluster must be configured with source hadoop configs else jobs fail with errors on secure and non-secure clusters as below: org.apache.hadoop.security.token.SecretManager$InvalidToken: Password not found for ApplicationAttempt appattempt_1395965672651_0010_000002 So I have configured my oozie as given below: <name>oozie.service.HadoopAccessorService.hadoop.configurations</name>
<value>*=/etc/hadoop/conf,nr1.hwxblr.com:8020=/etc/primary_conf/conf,nr3.hwxblr.com:8030=/etc/primary_conf/conf,nr21.hwxblr.com:8020=/etc/hadoop/conf,nr23.hwxblr.com:8030=/etc/hadoop/conf</value> Still I am facing below error. 2016-05-31 06:31:34,170 INFO [main] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: blacklistDisablePercent is 33
2016-05-31 06:31:34,204 INFO [main] org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at nr23.hwxblr.com/10.0.1.25:8030
2016-05-31 06:31:34,235 WARN [main] org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server :
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): appattempt_1464674342681_0001_000002 not found in AMRMTokenSecretManager.
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:375)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:558)
at org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:373)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:727)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:723)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422) However note that this job is running on source cluster RM from target cluster oozie and as per the log it looks like AM is trying to connect to target RM instead of source RM. Anyone can share any details on what could be wrong?
... View more
Labels:
- Labels:
-
Apache Falcon
05-30-2016
04:51 PM
2 Kudos
Hi @Sagar Shimpi
This is what I got from
Hadoop: The Definitive Guide, 4th Edition
All the YARN schedulers try to honor locality requests. On a busy cluster, if an application requests a particular node, there is a good chance that other containers are running on it at the time of the request. The obvious course of action is to immediately loosen the locality requirement and allocate a container on the same rack. However, it has been observed in practice that waiting a short time (no more than a few seconds) can dramatically increase the chances of being allocated a container on the requested node, and therefore increase the efficiency of the cluster. This feature is called delay scheduling, and it is supported by both the Capacity Scheduler and the Fair Scheduler.
Every node manager in a YARN cluster periodically sends a heartbeat request to the resource manager—by default, one per second. Heartbeats carry information about the node manager’s running containers and the resources available for new containers, so each heartbeat is a potential scheduling opportunity for an application to run a container.
When using delay scheduling, the scheduler doesn’t simply use the first scheduling opportunity it receives, but waits for up to a given maximum number of scheduling opportunities to occur before loosening the locality constraint and taking the next scheduling opportunity.
For the Capacity Scheduler, delay scheduling is configured by setting yarn.scheduler.capacity.node-locality-delay to a positive integer representing the number of scheduling opportunities that it is prepared to miss before loosening the node constraint to match any node in the same rack.
... View more
05-30-2016
02:23 PM
2 Kudos
Subramanian Santhanam Chek you hdfs-site.xml for dfs.data.dir. This is a comma-delimited list of directories. Remove what you do not need. If this is ambari managed then change this from ambari. HDFS -> Config -> DataNode directories Ensure that it is configured correctly.
... View more
05-30-2016
09:38 AM
@mayki wogno
You can check staging location of the cluster.
In my case it is something like /apps/falcon/<clustername>/staging/workflows/feed/<feed-name>/logs
You can check your source and target cluster definition for staging location.
Something like below:
$ falcon entity -type cluster -definition -name backup
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cluster name="backup" description="backup" colo="backup" xmlns="uri:falcon:cluster:0.1">
<tags>EntityType=Cluster</tags>
<interfaces>
<interface type="readonly" endpoint="hdfs://nr21.hwxblr.com:50070" version="2.7.1"/>
<interface type="write" endpoint="hdfs://nr21.hwxblr.com:8020" version="2.7.1"/>
<interface type="execute" endpoint="nr23.hwxblr.com:8050" version="2.7.1"/>
<interface type="workflow" endpoint="http://nr22.hwxblr.com:11000/oozie/" version="4.2.0"/> <interface type="messaging" endpoint="tcp://nr22.hwxblr.com:61616?daemon=true" version="5.1.6"/> <interface type="registry" endpoint="thrift://nr22.hwxblr.com:9083" version="1.2.1"/>
</interfaces>
<locations>
<location name="staging" path="/apps/falcon/backup/staging"/>
<location name="temp" path="/tmp"/>
<location name="working" path="/apps/falcon/backup/working"/>
</locations>
<ACL owner="falcon" group="users" permission="0x755"/>
</cluster>
Do let us know if it was helpful. RahulP
... View more
05-30-2016
09:21 AM
Hi @mayki wogno, Can you give more details about your feed?
... View more
05-30-2016
06:29 AM
Hi @Arulanand Dayalan Can you try setting proxy for yum. Refer this link.
... View more
- « Previous
- Next »