Member since
07-30-2019
53
Posts
136
Kudos Received
16
Solutions
12-15-2016
01:40 PM
9 Kudos
The Problem Traditional 'distcp' from one directory to another or from cluster to cluster is quite useful in moving massive amounts of data, once. But what happens when you need to "update" a target directory or cluster with only the changes made since the last 'distcp' had run. That becomes a very tricky scenario. 'distcp' offers an '-update' flag, which is suppose to move only the files that have changed. In this case 'distcp' will pull a list of files and directories from the source and targets, compare them and then build a migration plan. The problem: It's an expensive and time-consuming task. Furthermore, the process is not atomic. First, the cost of gathering a list of files and directories, along with their metadata is expensive when you're considering sources with millions of file and directory objects. And this cost is incurred on both the source and target namenode's, resulting in quite a bit of pressure on those systems. It's up to 'distcp' to reconcile the difference between the source and target, which is very expensive. When it's finally complete, only then does the process start to move data. And if data changes while the process is running, those changes can impact the transfer and lead to failure and partial migration. The Solution The process needs to be atomic, and it needs to be efficient. With Hadoop 2.0, HDFS introduce "snapshots." HDFS "snapshots" are a point-in-time copy of the directories metadata. The copy is stored in a hidden location and maintains references to all of the immutable filesystem objects. Creating a snapshot is atomic, and the characteristics of HDFS (being immutable) means that an image of a directories metadata doesn't require an addition copy of the underlying data. Another feature of snapshots is the ability to efficiently calculate changes between 'any' two snapshots on the same directory. Using 'hdfs snapshotDiff ', you can build a list of "changes" between these two point-in-time references. For Example [hdfs@m3 ~]$ hdfs snapshotDiff /user/dstreev/stats s1 s2
Difference between snapshot s1 and snapshot s2 under directory /user/dstreev/stats:
M .
+ ./attempt
M ./namenode/fs_state/2016-12.txt
M ./namenode/nn_info/2016-12.txt
M ./namenode/top_user_ops/2016-12.txt
M ./scheduler/queue_paths/2016-12.txt
M ./scheduler/queue_usage/2016-12.txt
M ./scheduler/queues/2016-12.txt
Let's take the 'distcp' update concept and supercharge it with the efficiency of snapshots. Now you have a solution that will scale far beyond the original 'distcp -update.' and in the process remove the burden and load from the namenode's previously encountered. Pre-Requisites and Requirements Source must support 'snapshots' hdfs dfsadmin -allowSnapshot <path> Target is "read-only" Target, after initial baseline 'distcp' sync needs to support snapshots. Process Identify the source and target 'parent' directory Do not initially create the destination directory, allow the first distcp to do that. For example: If I want to sync source `/data/a` with `/data/a_target`, do *NOT* pre-create the 'a_target' directory. Allow snapshots on the source directory hdfs dfsadmin -allowSnapshot /data/a Create a Snapshot of /data/a hdfs dfs -createSnapshot /data/a s1 Distcp the baseline copy (from the atomic snapshot). Note: /data/a_target does NOT exists prior to the following command. hadoop distcp /data/a/.snapshot/s1 /data/a_target Allow snapshots on the newly create target directory hdfs dfsadmin -allowSnapshot /data/a_target At this point /data/a_target should be considered "read-only". Do NOT make any changes to the content here. Create a matching snapshot in /data/a_target that matches the name of the snapshot used to build the baseline hdfs dfs -createSnapshot /data/a_target s1 Add some content to the source directory /data/a. Make changes, add, deletes, etc. that need to be replicated to /data/a_target. Take a new snapshot of /data/a hdfs dfs -createSnapshot /data/a s2 Just for fun, check on whats changed between the two snapshots hdfs snapshotDiff /data/a s1 s2 Ok, now let's migrate the changes to /data/a_target hadoop distcp -diff s1 s2 -update /data/a /data/a_target When that's completed, finish the cycle by creating a matching snapshot on /data/a_target hdfs dfs -createSnapshot /data/a_target s2 That's it. You've completed the cycle. Rinse and repeat. A Few Hints Remember, snapshots need to be managed manually. They will stay around forever unless you clean them up with: hdfs dfs -deleteSnapshot As long as a snapshot exists, the data exists. Deleting, even with skipTrash, data from a directory that has a snapshot, doesn't free up space. Only when all "references" to that data are gone, can space be reclaimed. Initial migrations of data between systems are very expensive in regards to network I/O. And you probably don't want to have to do that again, ever. I recommend keeping a snapshot of the original copy on each system OR some major checkpoint you can go back to, in the event the process is compromised. If 'distcp' can't validate that the snapshot (by name) between the source and the target are the same and that the data at the target hasn't changed since the snapshot, the process will fail. If the failure is because the directory has been updated, you'll need to use the above baseline snapshots to restore it without having to migrate all that data again. And then start the process up again.
... View more
Labels:
11-17-2015
02:22 PM
21 Kudos
There are five ways to connect to HS2 with JDBC Direct - Binary Transport mode (Non-Secure|Secure) Direct - HTTP Transport mode (Non-Secure|Secure) ZooKeeper - Binary Transport mode (Non-Secure|Secure) ZooKeeper - HTTP Transport mode (Non-Secure|Secure) via Knox - HTTP Transport mode
Connecting to HS2 via ZooKeeper (3-4) (and knox, if backed by ZooKeeper) provides a level of failover that you can't get directly. When connecting through ZooKeeper, the client is provided server connection information from a list of available servers. This list is managed on the backend and the client isn't aware of them, before the connection. This allows administrators to add additional servers to the list without reconfiguring the clients. NOTE: HS2 in this configuration is considered a Failover and is not automatic once a connection has been established. JDBC connections are stateful. The data and session information kept on HS2 for a connection is LOST when the server goes down. Jobs currently in progress, will be affected. You will need to "reconnect" to continue. At which time, you will be able to resubmit your job. Once an HS2 instance goes down, ZooKeeper will not forward connection requests to that server. By reconnecting, after an HS2 failure, you will connect to a working HS2 instance. URL Syntax jdbc:hive2://zookeeper_quorum|hs2_host:port/[db][;principal=<hs2_principal>/<hs2_host>|_HOST@<KDC_REALM>][;transportMode=binary|http][;httpPath=<http_path>][;serviceDiscoveryMode=zookeeper;zooKeeperNamespace=<zk_namespace>][;ssl=true|false][;sslKeyStore=<key_store_path>][;keyStorePassword=<key_store_password][;sslTrustStore=<trust_store_path>][;trustStorePassword=<trust_store_password>][;twoWay=true|false]
Assumptions: HS2 Host(s): m1.hdp.local and m2.hdp.local HS2 Binary Port: 10010 HS2 HTTP Port: 10011 ZooKeeper Quorom: m1.hdp.local:2181,m2.hdp.local:2181:m3.hdp.local:2181 HttpPath: cliservice HS2 ZooKeeper Namespace: hiveserver2 User: barney Password: bedrock NOTE: <db> is the database in the examples below and is optional. The leading slash '/' is required. WARNING: When using 'beeline' and specifying the connection url (-u) at the command line, be sure to quote the url. Non-Secure Environments Direct - Binary Transport Mode beeline -n barney -p bedrock -u "jdbc:hive2://m1.hdp.local:10010/<db>" Direct - HTTP Transport Mode beeline -n barney -p bedrock -u "jdbc:hive2://m1.hdp.local:10011/<db>;transportMode=http;httpPath=cliservice" ZooKeeper - Binary Transport Mode beeline -n barney -p bedrock -u "jdbc:hive2://m1.hdp.local:2181,m2.hdp.local:2181,m3.hdp.local:2181/<db>" ZooKeeper - Http Transport Mode beeline -n barney -p bedrock -u "jdbc:hive2://m1.hdp.local:2181,m2.hdp.local:2181,m3.hdp.local:2181/<db>;transportMode=http;httpPath=cliservice" Alternate Connectivity Thru Knox jdbc:hive2://<knox_host>:8443/;ssl=true;sslTrustStore=/var/lib/knox/data/security/keystores/gateway.jks;trustStorePassword=<password>?hive.server2.transport.mode=http;hive.server2.thrift.http.path=gateway/<CLUSTER>/hive Secure Environments Additional Assumptions KDC Realm: HDP.LOCAL HS2 Principal: hive The 'principal' used in the below examples can use either the fqdn of the HS2 Host in the principal or '_HOST'. '_HOST' is globally replaced based on your Kerberos configuration if you haven't altered the default Kerberos Regex patterns in ... NOTE: The client is required to 'kinit' before connecting through JDBC. The -n and -p (user / password) aren't necessary. They are handled by the Kerberos Ticket Principal. Direct - Binary Transport Mode beeline -u "jdbc:hive2://m1.hdp.local:10010/<db>;principal=hive/_HOST@HDP.LOCAL" Direct - HTTP Transport Mode beeline -u "jdbc:hive2://m1.hdp.local:10011/<db>;principal=hive/_HOST@HDP.LOCAL;transportMode=http;httpPath=cliservice" ZooKeeper - Binary Transport Mode beeline -u "jdbc:hive2://m1.hdp.local:2181,m2.hdp.local:2181,m3.hdp.local:2181/<db>;principal=hive/_HOST@HDP.LOCAL" ZooKeeper - Http Transport Mode
beeline -u "jdbc:hive2://m1.hdp.local:2181,m2.hdp.local:2181,m3.hdp.local:2181/<db>;principal=hive/_HOST@HDP.LOCAL;transportMode=http;httpPath=cliservice"
... View more
Labels:
10-13-2015
02:42 PM
9 Kudos
It's coming up more often now, the need to "add" host to an existing cluster without using the Ambari UI. I once thought the Blueprints were only useful to "initially" provision a cluster. But they're also quite helpful to extend your cluster as well. Provision the Initial Cluster using Auto-Discovery In the following example, we'll use a new feature in Ambari 2.1, called "Auto-Discovery" to provision the first node in our cluster. Do NOT manually register the Host with Ambari Server yet. We're going to demonstrate "Auto-Discovery" to initialize new hosts. Check that NO hosts have been registered with the cluster. curl -i -H "X-Requested-By: ambari" -u admin:xxxx -X GET http://<ambari-server:port>/api/v1/hosts Blueprint has be registered with Ambari {
"Blueprints": {
"stack_name": "HDP",
"stack_version": "2.3"
},
"configurations": [],
"host_groups": [
{
"cardinality": "1",
"components": [
{
"name": "ZOOKEEPER_SERVER"
},
{
"name": "ZOOKEEPER_CLIENT"
}
],
"configurations": [],
"name": "zookeeper_host_group"
},
{
"cardinality": "1",
"components": [
{
"name": "KAFKA_BROKER"
}
],
"configurations": [],
"name": "kafka_host_group"
}
]
} Save this above to a file and register the Blueprint with you cluster. curl -i -H "X-Requested-By: ambari" -u admin:admin -X POST -d @myblueprint.json http://ambari-server:8080/api/vi/blueprints/basic1 Create your cluster
Now lets create our cluster and use the Auto-Discovery feature to provision the host(s). This is the 'template' file we use to provision the cluster. {
"blueprint" : "basic1",
"host_groups" :[
{
"name" : "zookeeper_host_group",
"host_count" : "1",
"host_predicate" : "Hosts/cpu_count>0"
}
]
} At this point the cluster has been created in Ambari, but you should NOT have any hosts. Ambari Server is sitting around waiting for a host to be registered with the cluster matching the "predicate" above. The "host_count" is also an important part of the criteria. The process will not start until the number of host in the "host_count" are available. For our demonstration here, we've set it to 1. In the real world, you would need 3 to create a valid ZooKeeper quorum. Manually Register a Host (just one, for now)
Install an Ambari Agent on your host and configure it to talk to Ambari. The host should match the predicate above "cpu_count>0", which would be just about everything. But you get the point. Start the Ambari-Agent and return back to the Ambari Server UI. In short order, you should see operations kicking off to provision the node and add the service. Now you have a cluster with 1 node and a running ZooKeeper Service. Lets Expand our Cluster With a second host, lets manually register the agent with Ambari. Check via the API that the host has been registered with the Ambari Server curl -i -H "X-Requested-By: ambari" -u admin:xxxx -X GET http://<ambari-server:port>/api/v1/hosts Notice that the previously registered Blueprint "basic1", has a host group for Kafka Brokers. Let's add this newly registered host to our cluster as a Kakfa Broker. Create a file to contain the message body and save it. {
"blueprint" : "basic1",
"host_group" : "kafka_host_group"
} Add and provision the host with the services specified in the Blueprint. curl -i -H "X-Requested-By: ambari" -u admin:xxxx -X POST -d @kafka_host.json http://<ambari-server:port>/api/v1/clusters/<yourcluster>/hosts/<new_host_id>; The <new_host_id> is the name that it's been registered to Ambari by, should be the host FQDN. Double check the name returned by the hosts query above. This will add the host to you new cluster and provision it with the services configured in the Blueprint. Summary In the past, Blueprints were used to initialized clusters. Now you can use them to "extend" your cluster. Avoid trying to reverse engineer the process of installing and configuring services through the REST API, use "Blueprints". You'll save yourself a lot of effort!! References Ambari Blueprints
... View more
Labels:
10-02-2015
09:08 PM
8 Kudos
Goal
Create and Setup IPA Keytabs for HDP Tested on CentOS 7 with IPA 4, HDP 2.3.0 and Ambari 2.1.1 Assumptions
Ambari does NOT currently (2.1.x) support the automatic generation of keytabs against IPA IPA Server is already installed and IPA clients have been installed/configured on all HDP cluster nodes If you are using Accumulo, you need to create the user 'tracer' in IPA. A keytab for this user will be requested for Accumulo Tracer. We'll use Ambari's 'resources' directory to simplify the distribution of scripts and keys. Enable kerberos using wizard In Ambari, start security wizard by clicking Admin -> Kerberos and click Enable Kerberos. Then select "Manage Kerberos principals and key tabs manually" option Enter your realm (Optional, Read on before doing this...) Remove clustername from smoke/hdfs principals to remove the -${cluster_name} references to look like below
smoke user principal: ${cluster-env/smokeuser}@${realm} HDFS user principal: ${hadoop-env/hdfs_user}@${realm} HBase user principal: ${hbase-env/hbase_user}@${realm} Spark user principal: ${spark-env/spark_user}@${realm} Accumulo user principal: ${accumulo-env/accumulo_user}@${realm} Accumulo Tracer User: tracer@${realm} If you don't remove the "cluster-name" from above, Ambari will generate/use principal names that are specific to your cluster. This could be very important if you are supporting multiple clusters with the same IPA implementation. On next page download csv file but DO NOT click Next yet!
If you are deploying storm, the storm user maybe missing from the storm USER row. If you see something like the below: storm@HORTONWORKS.COM,USER,,/etc
replace the ,, with ,storm, storm@HORTONWORKS.COM,USER,storm,/etc
Copy above csv to the Ambari Server and place it in the /var/lib/ambari-server/resources directory, making sure to remove the header and any empty lines at the end. vi kerberos.csv
From any IPA Client Node Create principals and service accounts using csv file. ## authenticate
kinit admin
AMBARI_HOST=<ambari_host>
# Get the kerberos.csv file from Ambari
wget http://${AMBARI_HOST}:8080/resources/kerberos.csv -O /tmp/kerberos.csv
# Create IPA service entries.
awk -F"," '/SERVICE/ {print "ipa service-add --force "$3}' /tmp/kerberos.csv | sort -u > ipa-add-spn.sh
sh ipa-add-spn.sh
# Create IPA User accounts
awk -F"," '/USER/ {print "ipa user-add "$5" --first="$5" --last=Hadoop --shell=/sbin/nologin"}' /tmp/kerberos.csv | sort | uniq > ipa-add-upn.sh
sh ipa-add-upn.sh
On Ambari node authenticate and create the keytabs for the headless user accounts and initialize the service keytabs. ## authenticate
sudo echo '<kerberos_password>' | kinit --password-file=STDIN admin
## or (IPA 4)
sudo echo '<kerberos_password>' | kinit -X password-file=STDIN admin
Should be run as root (adjust for your Ambari Host/Port). AMBARI_HOST_PORT=<ambari_host>
wget http://${AMBARI_HOST_PORT}/resources/kerberos.csv -O /tmp/kerberos.csv
ipa_server=$(grep server /etc/ipa/default.conf | awk -F= '{print $2}')
if [ "${ipa_server}X" == "X" ]; then
ipa_server=$(grep host /etc/ipa/default.conf | awk -F= '{print $2}')
fi
if [ -d /etc/security/keytabs ]; then
mv -f /etc/security/keytabs /etc/security/keytabs.`date +%Y%m%d%H%M%S`
fi
mkdir -p /etc/security/keytabs
chown root:hadoop /etc/security/keytabs/
if [ ! -d /var/lib/ambari-server/resources/etc/security/keytabs ]; then
mkdir -p /var/lib/ambari-server/resources/etc/security/keytabs
fi
grep USER /tmp/kerberos.csv | awk -F"," '{print "ipa-getkeytab -s '${ipa_server}' -p "$3" -k "$6";chown "$7":"$9,$6";chmod "$11,$6}' | sort -u > gen_keytabs.sh
# Copy the 'user' keytabs to the Ambari Resources directory for distribution.
echo "cp -f /etc/security/keytabs/*.* /var/lib/ambari-server/resources/etc/security/keytabs/" >> gen_keytabs.sh
# ReGenerate Keytabs for all the required Service Account, EXCEPT for the HTTP service account on the IPA Server host.
grep SERVICE /tmp/kerberos.csv | awk -F"," '{print "ipa-getkeytab -s '${ipa_server}' -p "$3" -k "$6";chown "$7":"$9,$6";chmod "$11,$6}' | sort -u | grep -v HTTP\/${ipa_server} >> gen_keytabs.sh
# Allow the 'admins' group to retrieve the keytabs.
grep SERVICE /tmp/kerberos.csv | awk -F"," '{print "ipa service-allow-retrieve-keytab "$3" --group=admins"}' | sort -u >> gen_keytabs.sh
bash ./gen_keytabs.sh
# Now remove the keytabs, they'll be replaced by the distribution phase.
mv -f /etc/security/keytabs /etc/security/genedkeytabs.`date +%Y%m%d%H%M%S`
mkdir /etc/security/keytabs
chown root:hadoop /etc/security/keytabs
Build a distribution script used to create host specific keytabs (adjust for your Ambari Host/Port). vi retrieve_keytabs.sh
# Set the location of Ambari
AMBARI_HOST_PORT=<ambari_host>
# Retrieve the kerberos.csv file from the wizard
wget http://${AMBARI_HOST_PORT}/resources/kerberos.csv -O /tmp/kerberos.csv
ipa_server=$(grep server /etc/ipa/default.conf | awk -F= '{print $2}')
if [ "${ipa_server}X" == "X" ]; then
ipa_server=$(grep host /etc/ipa/default.conf | awk -F= '{print $2}')
fi
if [ ! -d /etc/security/keytabs ]; then
mkdir -p /etc/security/keytabs
fi
chown root:hadoop /etc/security/keytabs/
# Retrieve WITHOUT recreating the existing keytabs for each account.
grep USER /tmp/kerberos.csv | awk -F"," '/'$(hostname -f)'/ {print "wget http://'$( echo ${AMBARI_HOST_PORT})'/resources"$6" -O "$6";chown "$7":"$9,$6";chmod "$11,$6}' | sort -u > get_host_keytabs.sh
grep SERVICE /tmp/kerberos.csv | awk -F"," '/'$(hostname -f)'/ {print "ipa-getkeytab -s '$(echo $ipa_server)' -r -p "$3" -k "$6";chown "$7":"$9,$6";chmod "$11,$6}' | sort -u >> get_host_keytabs.sh
bash ./get_host_keytabs.sh
Copy file to the Ambari Servers resource directory for distribution. scp retrieve_keytabs.sh root@<ambari_host>:/var/lib/ambari-server/resources
On the Each HDP node via 'pdsh' authenticate and create the keytabs # Should be logging in as 'root'
pdsh -g <host_group> -l root
> ## authenticate as the KDC Admin
> echo '<kerberos_password>' | kinit --password-file=STDIN admin
> # or (for IPA 4)
> echo '<kerberos_password>' | kinit -X password-file=STDIN admin
> wget http://<ambari_host>:8080/resources/retrieve_keytabs.sh -O /tmp/retrieve_keytabs.sh
> bash /tmp/retrieve_keytabs.sh
> ## Verify kinit works before proceeding (should not give errors)
> # Service Account Check (replace REALM with yours)
> sudo -u hdfs kinit -kt /etc/security/keytabs/nn.service.keytab nn/$(hostname -f)@HORTONWORKS.COM
> # Headless Check (check on multiple hosts)
> sudo -u ambari-qa kinit -kt /etc/security/keytabs/smokeuser.headless.keytab ambari-qa@HORTONWORKS.COM
> sudo -u hdfs kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs@HORTONWORKS.COM
WARNING: IPA UI maybe broken after above procedure When the process to build keytabs for services is run on the same host that IPA lives on, it will invalidate the keytab used by Apache HTTPD to authenticate. I've added a step that should eliminate the "re"-creation of the key tab, but just incase.. Replace /etc/httpd/conf/ipa.keytab with /etc/security/keytabs/spnego.service.keytab cd /etc/httpd/conf
mv ipa.keytab ipa.keytab.orig
cp /etc/security/keytabs/spnego.service.keytab ipa.keytab
chown apache:apache ipa.keytab
service httpd restart
Remove the headless.keytabs.tgz file from /var/lib/ambari-server/resources on the Ambari-Server. Press next on security wizard and proceed to stop services Proceed with the next steps of wizard by clicking Next Once completed, click Complete and now the cluster is kerborized Using your Kerberized cluster Try to run commands without authenticating to kerberos. $ hadoop fs -ls /
15/07/15 14:32:05 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
$ curl -u someuser -skL "http://$(hostname -f):50070/webhdfs/v1/user/?op=LISTSTATUS"
<title>Error 401 Authentication required</title>
Get a token ## for the current user
sudo su - gooduser
kinit
## for any other user
kinit someuser
Use the cluster Hadoop Commands $ hadoop fs -ls /
Found 8 items
[...]
WebHDFS ## note the addition of `--negotiate -u : `
curl -skL --negotiate -u : "http://$(hostname -f):50070/webhdfs/v1/user/?op=LISTSTATUS"
Hive (using Beeline or another Hive JDBC client)
Hive in Binary mode (the default) beeline -u "jdbc:hive2://localhost:10000/default;principal=hive/$(hostname -f)@HORTONWORKS.COM"
Hive in HTTP mode ## note the update to use HTTP and the need to provide the kerberos principal.
beeline -u "jdbc:hive2://localhost:10001/default;transportMode=http;httpPath=cliservice;principal=HTTP/$(hostname -f)@HORTONWORKS.COM"
Thank you to @abajwa@hortonworks.com (Ali Bajwa) for his original workshop this is intended to extend. https://github.com/abajwa-hw/security-workshops
... View more
10-02-2015
08:51 PM
5 Kudos
The Phoenix documentation here leaves out a few pieces in order to make a successful connection to HBase, through the Phoenix Driver. They assume that the connection is from the 'localhost'. May work great, but that's unlikely in the real world. Required Jars phoenix-client.jar hbase-client.jar (not mentioned in the Phoenix Docs) URL Details The full adbc connection URL syntax for phoenix is: Basic Connections
jdbc:phoneix[:zk_quorum][:zk_port][:zk_hbase_path] The "zk_quorum" is a comma separated list of the ZooKeeper Servers. The "zk_port" is the ZooKeeper port. The "zk_hbase_path" is the path used by Hbase to stop information about the instance. On a 'non' kerberized cluster the default zk_hbase_path for HDP is '/hbase-unsecure'. Sample JDBC URL
jdbc:phoenix:m1.hdp.local,m2.hdp.local,d1.hdp.local: 2181 :/hbase-unsecure
For Kerberos Cluster Connections
jdbc:phoneix[:zk_quorum][:zk_port][:zk_hbase_path][:headless_keytab_file:principal]
Sample JDBC URL
jdbc:phoenix:m1.hdp.local,m2.hdp.local,d1.hdp.local: 2181 :/hbase-secure:/Users/dstreev/keytabs/myuser.headless.keytab:dstreev @HDP .LOCAL They say that the above items in the url beyond 'phoenix' are optional if the clusters 'hbase-site.xml' file is in the path. I found that when I presented a copy of the 'hbase-site.xml' from the cluster and left off the optional elements, I got errors referencing 'Failed to create local dir'. When I connected successfully from DBVisualizer to HBase with the Phoenix JDBC Driver, it took about 10-20 seconds to connect.
... View more
Labels:
10-02-2015
08:48 PM
7 Kudos
httpfs is needed to support a centralized WebHDFS interface to an HA enable NN Cluster. This can be used by Hue or any other WebHDFS enabled client that needs to use a cluster configured with a High-Availability Namenode. The installation is a piece of cake: yum install hadoop-httpfs But that's were the fun ends!!! Configuring is a whole other thing. It's not hard, if you know the right buttons to push. Unfortunately, the buttons and directions for doing this can be quite aloof. The httpfs service is a tomcat application that relies on having the Hadoop libraries and configuration available, so it can resolve your HDP installation. When you do the installation (above), a few items are installed. /usr/hdp/2.2.x.x-x/hadoop-httpfs
/etc/hadoop-httpfs/conf
/etc/hadoop-httpfs/tomcat-deployment Configuring - The Short Version Set the version for current with hdp-select set hadoop-httpfs 2.2.x.x-x From this point on, many of our changes are designed to "fix" the "hardcoded" implementations in the deployed scripts. Adjust the /usr/hdp/current/hadoop-httpfs/sbin/httpfs.sh script #!/bin/bash
# Autodetect JAVA_HOME if not defined
if [ -e /usr/libexec/bigtop-detect-javahome ]; then
. /usr/libexec/bigtop-detect-javahome
elif [ -e /usr/lib/bigtop-utils/bigtop-detect-javahome ]; then
. /usr/lib/bigtop-utils/bigtop-detect-javahome
fi
### Added to assist with locating the right configuration directory
export HTTPFS_CONFIG=/etc/hadoop-httpfs/conf
### Remove the original HARD CODED Version reference... I mean, really???
export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/current/hadoop-client}
export HADOOP_LIBEXEC_DIR=${HADOOP_HOME}/libexec
exec /usr/hdp/current/hadoop-httpfs/sbin/httpfs.sh.distro "$@" Now let's create a few symlinks to connect the pieces together cd /usr/hdp/current/hadoop-httpfs
ln -s /etc/hadoop-httpfs/tomcat-deployment/conf conf
ln -s ../hadoop/libexec libexec
Like all the other Hadoop components, httpfs follows use *-env.sh files to control the startup environment. Above, in the httpfs.sh script we set the location of the configuration directory. That is used to find and load the httpfs-env.sh file we'll modified below. # Add these to control and set the Catalina directories for starting and finding the httpfs application
export CATALINA_BASE=/usr/hdp/current/hadoop-httpfs
export HTTPFS_CATALINA_HOME=/etc/hadoop-httpfs/tomcat-deployment
# Set a log directory that matches your standards
export HTTPFS_LOG=/var/log/hadoop/httpfs
# Set a tmp directory for httpfs to store interim files
export HTTPFS_TEMP=/tmp/httpfs That's it!! Now run it! cd /usr/hdp/current/hadoop-httpfs/sbin
./httpfs.sh start
# To Stop
./httpfs.sh stop Try it out!! http://m1.hdp.local:14000/webhdfs/v1/user?user.name=hdfs&op=LISTSTATUS Obviously, changing out with your target host. The default port is 14000. If you want to change that, add the following to: export HTTPFS_HTTP_PORT=<new_port> Want to Add httpfs as a Service (auto-start)? The HDP installation puts a set of init.d files in the specific versions directory. cd /usr/hdp/<hdp.version>/etc/rc.d/init.d Create a symlink to this in /etc/init.d ln -s /usr/hdp/<hdp.version>/etc/rc.d/init.d/hadoop-httpfs /etc/init.d/hadoop-httpfs Then set up the service to run on restart # As Root User
chkconfig --add hadoop-httpfs # Start Service
service hadoop-httpfs start
# Stop Service
service hadoop-httpfs stop This method will run the service as the 'httpfs' user. Ensure that the 'httpfs' user has permissions to write to the log directory (/var/log/hadoop/httpfs if you followed these directions). A Little More Detail Proxies are fun, aren't they? We'll they'll affect you here as well. The directions here mention these proxy settings in core-site.xml. <property>
<name>hadoop.proxyuser.httpfs.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.httpfs.hosts</name>
<value>*</value>
</property> This means that httpfs.sh must be run as the httpfs user, in order to work. If you want to run the service with another user, adjust the proxy settings above.
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
... View more
Labels:
09-25-2015
05:47 PM
1 Kudo
Managing Capacity Scheduler via text can get messy. And it requires a bit of research by the user to figure out what settings are available. Try using the Capacity Scheduler View in Ambari 2.1+. It will make managing queues much more simple. Note: You'll need to set it up first through "Manage Ambari". A sample of a complex Queue Layout. Through this interface, you can also manage newer features in YARN that map users to queues, via ACL. If you're using Ranger to secure HDP, the YARN plugin will extend this capability even more!!!
... View more