Member since
07-30-2019
53
Posts
136
Kudos Received
16
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5615 | 01-30-2017 05:05 PM | |
2487 | 01-13-2017 03:46 PM | |
1397 | 01-09-2017 05:36 PM | |
698 | 01-09-2017 05:29 PM | |
575 | 10-07-2016 03:34 PM |
06-13-2018
10:13 AM
You could run Teragen/Sort for this. Here's a script on my gist.github.com page that can be run against an HDP cluster for this. You can control the size, mappers and reducers from the commandline, even experiment with block sizes.
... View more
04-04-2018
11:42 AM
Does the user have access (File System Level) to the warehouse directory you've specified? The docs seem to indicate that the 'spark.sql.warehouse.dir' is optional when Hive is already present and you're attaching to a metastore. --- Users who do not have an existing Hive deployment can still enable Hive support. When not configured by the hive-site.xml , the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir , which defaults to the directory spark-warehouse in the current directory that the Spark application is started. --- Try omitting that setting from your application.
... View more
04-04-2018
11:31 AM
Are all the nodes sharing the same user/group mapping? The NN is responsible for doing the group lookup for the user. So if the user/group mapping isn't present there, your results will not match.
... View more
06-29-2017
05:53 PM
2 Kudos
That will only set it for newly created directories. Using the HDFS client, set the replication factor for the directory to the new value.
... View more
01-30-2017
05:05 PM
Rahul, Are the logs making it to HDFS? It sounds like you might be combining the "spooling" directory with the "local audit archive directory". What properties did you use during the Ranger HDFS Plugin installation? Are you doing a manual install or using Ambari? If manual, then this reference might help: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_command-line-installation/content/installing_ranger_plugins.html#installing_ranger_hdfs_plugin I wasn't able to locate your "...filespool.archive.dir" property on my cluster. I'm not sure the property is required. And may be responsible for keeping the files "locally" that you've already posted to HDFS. If the files are making it to HDFS, I would try removing this setting. What do you have set for the property below? And are the contents being flushed from that location on a regular basis? xasecure.audit.destination.hdfs.batch.filespool.dir Compression doesn't happen during this process. Once they're on HDFS, you're free to do with them as you see fit. If compression is a part of that, then write an MR job to do so. (WARNING: Could affect other systems that might want to use these files as is) Cheers, David
... View more
01-27-2017
01:29 PM
Those are intermediate directories used to store the stream of activity locally, before it's written to HDFS. You should have destination directories in HDFS for the final resting place. In my experience, when this issue happens and you don't see those directories in HDFS. It could be a permissions issue or the fact that the directories just need to be created manually. You may need to create the directories in HDFS manually and ensure they have the proper ACL's to allow them the be written to by the process.
... View more
01-13-2017
03:46 PM
distcp recognizes the s3[a] protocols from the default libraries already available in Hadoop. For example: Moving data from Hadoop to S3. hadoop distcp <current_cluster_folder> s3[a]://<bucket_info> If you're looking for ways to manage access (via AWS Keys) to S3 Buckets in Hadoop, this article is a great secure way to do that. https://community.hortonworks.com/articles/59161/using-hadoop-credential-api-to-store-aws-secrets.html
... View more
01-13-2017
03:23 PM
Jacqualin, Yes, the local dir and log dir both support multiple locations. And I advise using multiple locations to scale better. These directories aren't HDFS and therefore don't support hdfs replication, but that's ok. It's used for file caches and intermediate data. If you lose a drive in the middle of processing, only the "task" is affected, which may fail. In this case, the task is rescheduled somewhere else. So the job would be affected. A failed drive in yarn_local_dir is ok, as the NodeManager with tag it and not use it going forward. One more reason to have more than 1 drive specified here. BUT, in older versions of YARN, a failed drive can prevent the NodeManager from "starting" or "restarting." It's pretty clear in the logs of the NodeManager if you have issues with it starting at any time. Yarn also indicated drive failures in the Resource Manager UI. A Newer version of YARN is a bit more forgiving on startup.
... View more
01-09-2017
05:36 PM
1 Kudo
Having multiple values here allows for better scalability and performance for YARN and intermediate writes/reads. Much like HDFS has multiple directories (preferably on different mount point/physical drives), YARN LOCAL dirs can use this to spread the IO load. I also seen trends where customers use SSD drives for YARN LOCAL DIRS, which can significantly improve job performance. IE: 12 drive system. 8 drives are SATA drives for HDFS directories and 4 drives are smaller, fast SSD drives for YARN LOCAL DIRS.
... View more
01-09-2017
05:29 PM
Could you please identify which version of Ambari you are running? In these situations, I usually drop down to the host that is presenting the issue and try to run the command on the host. This may help provide a bit more detail on the actual issue. In this case, you may find that you need to remove the offending package yum erase <specific_package>, then have Ambari try to reinstall the packages.
... View more
12-15-2016
01:40 PM
9 Kudos
The Problem Traditional 'distcp' from one directory to another or from cluster to cluster is quite useful in moving massive amounts of data, once. But what happens when you need to "update" a target directory or cluster with only the changes made since the last 'distcp' had run. That becomes a very tricky scenario. 'distcp' offers an '-update' flag, which is suppose to move only the files that have changed. In this case 'distcp' will pull a list of files and directories from the source and targets, compare them and then build a migration plan. The problem: It's an expensive and time-consuming task. Furthermore, the process is not atomic. First, the cost of gathering a list of files and directories, along with their metadata is expensive when you're considering sources with millions of file and directory objects. And this cost is incurred on both the source and target namenode's, resulting in quite a bit of pressure on those systems. It's up to 'distcp' to reconcile the difference between the source and target, which is very expensive. When it's finally complete, only then does the process start to move data. And if data changes while the process is running, those changes can impact the transfer and lead to failure and partial migration. The Solution The process needs to be atomic, and it needs to be efficient. With Hadoop 2.0, HDFS introduce "snapshots." HDFS "snapshots" are a point-in-time copy of the directories metadata. The copy is stored in a hidden location and maintains references to all of the immutable filesystem objects. Creating a snapshot is atomic, and the characteristics of HDFS (being immutable) means that an image of a directories metadata doesn't require an addition copy of the underlying data. Another feature of snapshots is the ability to efficiently calculate changes between 'any' two snapshots on the same directory. Using 'hdfs snapshotDiff ', you can build a list of "changes" between these two point-in-time references. For Example [hdfs@m3 ~]$ hdfs snapshotDiff /user/dstreev/stats s1 s2
Difference between snapshot s1 and snapshot s2 under directory /user/dstreev/stats:
M .
+ ./attempt
M ./namenode/fs_state/2016-12.txt
M ./namenode/nn_info/2016-12.txt
M ./namenode/top_user_ops/2016-12.txt
M ./scheduler/queue_paths/2016-12.txt
M ./scheduler/queue_usage/2016-12.txt
M ./scheduler/queues/2016-12.txt
Let's take the 'distcp' update concept and supercharge it with the efficiency of snapshots. Now you have a solution that will scale far beyond the original 'distcp -update.' and in the process remove the burden and load from the namenode's previously encountered. Pre-Requisites and Requirements Source must support 'snapshots' hdfs dfsadmin -allowSnapshot <path> Target is "read-only" Target, after initial baseline 'distcp' sync needs to support snapshots. Process Identify the source and target 'parent' directory Do not initially create the destination directory, allow the first distcp to do that. For example: If I want to sync source `/data/a` with `/data/a_target`, do *NOT* pre-create the 'a_target' directory. Allow snapshots on the source directory hdfs dfsadmin -allowSnapshot /data/a Create a Snapshot of /data/a hdfs dfs -createSnapshot /data/a s1 Distcp the baseline copy (from the atomic snapshot). Note: /data/a_target does NOT exists prior to the following command. hadoop distcp /data/a/.snapshot/s1 /data/a_target Allow snapshots on the newly create target directory hdfs dfsadmin -allowSnapshot /data/a_target At this point /data/a_target should be considered "read-only". Do NOT make any changes to the content here. Create a matching snapshot in /data/a_target that matches the name of the snapshot used to build the baseline hdfs dfs -createSnapshot /data/a_target s1 Add some content to the source directory /data/a. Make changes, add, deletes, etc. that need to be replicated to /data/a_target. Take a new snapshot of /data/a hdfs dfs -createSnapshot /data/a s2 Just for fun, check on whats changed between the two snapshots hdfs snapshotDiff /data/a s1 s2 Ok, now let's migrate the changes to /data/a_target hadoop distcp -diff s1 s2 -update /data/a /data/a_target When that's completed, finish the cycle by creating a matching snapshot on /data/a_target hdfs dfs -createSnapshot /data/a_target s2 That's it. You've completed the cycle. Rinse and repeat. A Few Hints Remember, snapshots need to be managed manually. They will stay around forever unless you clean them up with: hdfs dfs -deleteSnapshot As long as a snapshot exists, the data exists. Deleting, even with skipTrash, data from a directory that has a snapshot, doesn't free up space. Only when all "references" to that data are gone, can space be reclaimed. Initial migrations of data between systems are very expensive in regards to network I/O. And you probably don't want to have to do that again, ever. I recommend keeping a snapshot of the original copy on each system OR some major checkpoint you can go back to, in the event the process is compromised. If 'distcp' can't validate that the snapshot (by name) between the source and the target are the same and that the data at the target hasn't changed since the snapshot, the process will fail. If the failure is because the directory has been updated, you'll need to use the above baseline snapshots to restore it without having to migrate all that data again. And then start the process up again.
... View more
- Find more articles tagged with:
- disaster-recovery
- Distcp
- Hadoop Core
- HDFS
- How-ToTutorial
- snapshot
Labels:
10-07-2016
03:34 PM
Enabling Ranger audit's will show who made the sql call and what query was issued to HS2. This is more "metadata" centric, the actually data transferred is not logged in any permanent fashion. That would be the responsibility of the client. But the combination of the audit (who and what) along with possibly a "hdfs snapshot" can lead to a reproducible scenario.
... View more
10-07-2016
03:04 PM
4 Kudos
The journal nodes can be quite IO intensive, while the Namenode is generally more memory and CPU intensive. So one could justify co-locating them. BUT, when it comes to checkpointing, they could conflict. More importantly, delays in writing for the journal node will impact the namenode and result in higher RPC Queue times. With a cluster that size, I would always want to run the namenode by itself. It's far too important to compromise it by co-locating it with another highly active service. And regarding the Journal Node, don't store the journal directories on an LVM that's shared with the OS. Again, the Journal Node is quite IO intensive. And I've seen it project slowness back to the Namenode (in RPC queue times) when they are competing with the OS because they are sharing the same physical disks.
... View more
03-22-2016
09:51 PM
4 Kudos
Repo Description Sessions remember directory context, 'tab' completion, kerberos support, initialization scripts and a few new 'hdfs' features you wish you had. New extensions that help gather runtime statistics from the Namenode, Scheduler, Job History Server and Container Usage. Added support for "lsp" directory listing "plus" that identifies file information PLUS block information and location. Helpful when determining how well your data is distributed across the cluster and for identifying small file issues. Repo Info Github Repo URL https://github.com/dstreev/hadoop-cli Github account name dstreev Repo name hadoop-cli
... View more
- Find more articles tagged with:
- Cloud & Operations
- hadoop
- HDFS
- namenode
- statistics
- utilities
Labels:
11-17-2015
02:22 PM
21 Kudos
There are five ways to connect to HS2 with JDBC Direct - Binary Transport mode (Non-Secure|Secure) Direct - HTTP Transport mode (Non-Secure|Secure) ZooKeeper - Binary Transport mode (Non-Secure|Secure) ZooKeeper - HTTP Transport mode (Non-Secure|Secure) via Knox - HTTP Transport mode
Connecting to HS2 via ZooKeeper (3-4) (and knox, if backed by ZooKeeper) provides a level of failover that you can't get directly. When connecting through ZooKeeper, the client is provided server connection information from a list of available servers. This list is managed on the backend and the client isn't aware of them, before the connection. This allows administrators to add additional servers to the list without reconfiguring the clients. NOTE: HS2 in this configuration is considered a Failover and is not automatic once a connection has been established. JDBC connections are stateful. The data and session information kept on HS2 for a connection is LOST when the server goes down. Jobs currently in progress, will be affected. You will need to "reconnect" to continue. At which time, you will be able to resubmit your job. Once an HS2 instance goes down, ZooKeeper will not forward connection requests to that server. By reconnecting, after an HS2 failure, you will connect to a working HS2 instance. URL Syntax jdbc:hive2://zookeeper_quorum|hs2_host:port/[db][;principal=<hs2_principal>/<hs2_host>|_HOST@<KDC_REALM>][;transportMode=binary|http][;httpPath=<http_path>][;serviceDiscoveryMode=zookeeper;zooKeeperNamespace=<zk_namespace>][;ssl=true|false][;sslKeyStore=<key_store_path>][;keyStorePassword=<key_store_password][;sslTrustStore=<trust_store_path>][;trustStorePassword=<trust_store_password>][;twoWay=true|false]
Assumptions: HS2 Host(s): m1.hdp.local and m2.hdp.local HS2 Binary Port: 10010 HS2 HTTP Port: 10011 ZooKeeper Quorom: m1.hdp.local:2181,m2.hdp.local:2181:m3.hdp.local:2181 HttpPath: cliservice HS2 ZooKeeper Namespace: hiveserver2 User: barney Password: bedrock NOTE: <db> is the database in the examples below and is optional. The leading slash '/' is required. WARNING: When using 'beeline' and specifying the connection url (-u) at the command line, be sure to quote the url. Non-Secure Environments Direct - Binary Transport Mode beeline -n barney -p bedrock -u "jdbc:hive2://m1.hdp.local:10010/<db>" Direct - HTTP Transport Mode beeline -n barney -p bedrock -u "jdbc:hive2://m1.hdp.local:10011/<db>;transportMode=http;httpPath=cliservice" ZooKeeper - Binary Transport Mode beeline -n barney -p bedrock -u "jdbc:hive2://m1.hdp.local:2181,m2.hdp.local:2181,m3.hdp.local:2181/<db>" ZooKeeper - Http Transport Mode beeline -n barney -p bedrock -u "jdbc:hive2://m1.hdp.local:2181,m2.hdp.local:2181,m3.hdp.local:2181/<db>;transportMode=http;httpPath=cliservice" Alternate Connectivity Thru Knox jdbc:hive2://<knox_host>:8443/;ssl=true;sslTrustStore=/var/lib/knox/data/security/keystores/gateway.jks;trustStorePassword=<password>?hive.server2.transport.mode=http;hive.server2.thrift.http.path=gateway/<CLUSTER>/hive Secure Environments Additional Assumptions KDC Realm: HDP.LOCAL HS2 Principal: hive The 'principal' used in the below examples can use either the fqdn of the HS2 Host in the principal or '_HOST'. '_HOST' is globally replaced based on your Kerberos configuration if you haven't altered the default Kerberos Regex patterns in ... NOTE: The client is required to 'kinit' before connecting through JDBC. The -n and -p (user / password) aren't necessary. They are handled by the Kerberos Ticket Principal. Direct - Binary Transport Mode beeline -u "jdbc:hive2://m1.hdp.local:10010/<db>;principal=hive/_HOST@HDP.LOCAL" Direct - HTTP Transport Mode beeline -u "jdbc:hive2://m1.hdp.local:10011/<db>;principal=hive/_HOST@HDP.LOCAL;transportMode=http;httpPath=cliservice" ZooKeeper - Binary Transport Mode beeline -u "jdbc:hive2://m1.hdp.local:2181,m2.hdp.local:2181,m3.hdp.local:2181/<db>;principal=hive/_HOST@HDP.LOCAL" ZooKeeper - Http Transport Mode
beeline -u "jdbc:hive2://m1.hdp.local:2181,m2.hdp.local:2181,m3.hdp.local:2181/<db>;principal=hive/_HOST@HDP.LOCAL;transportMode=http;httpPath=cliservice"
... View more
- Find more articles tagged with:
- Data Processing
- hiveserver2
- how-to-tutorial
- How-ToTutorial
- jdbc
Labels:
11-17-2015
10:20 AM
1 Kudo
Almost forgot... Another side affect of small files on a cluster shows up while running the Balancer. You'll move LESS data, and increase the impact on the Namenode even further. So YES, small files are bad. But only as bad as you're willing to "pay" for in most cases. If you don't mind doubling the size of the cluster to address some files, then do it. But I wouldn't. I'd do a bit of planning and refinement and bank the extra nodes for more interesting projects. 🙂
... View more
11-17-2015
10:15 AM
22 Kudos
I've seen several systems with 400+ million objects represented in the Namenode without issues. In my opinion, that's not the "right" question though. Certainly, the classic answer to small files has been the pressure it put's on the Namenode but that's only a part of the equation. And with hardware / cpu and increase memory thresholds, that number has certainly climbed over the years since the small file problem was documented. The better question is: How do small files "impact" cluster performance? Everything is a trade-off when dealing with data at scale. The impact of small files, beyond the Namenode pressures, is more specifically related to "job" performance. Under classic MR, the number of small files controls the number of mappers required to perform a job. Of course, there are tricks to "combine" inputs and reduce this, but that leads to a lot of data back planing and increased cluster I/O chatter. A mapper in the classic sense, is a costly resource to allocate. If the actual task done by the mapper is rather mundane, most of the time spent accomplishing your job can be "administrative" in nature with the construction and management of all those resources. Consider the impact to a cluster when this happens. For example, I had a client once that was trying to get more from their cluster but there was a job that was processing 80,000 files. Which lead to the creation of 80,000 mappers. Which lead to consuming ALL the cluster resources, several times over. Follow that path a bit further and you'll find that the impact on the Namenode is exacerbated with all of the intermediate files generated by the mapper for the shuffle/sort phases. That's the real impact on a cluster. A little work in the beginning can have a dramatic affect on the downstream performance of your jobs. Take the time to "refine" your data and consolidate your files. Here's another way to approach it, which is even more evident when dealing with ORC files. Processing a 1Mb file has an overhead to it. So processing 128 1Mb files will cost you 128 times more "administrative" overhead, versus processing 1 128Mb file. In plain text, that 1Mb file may contain 1000 records. The 128 Mb file might contain 128000 records. And I've typically seen 85-92% compression ratio with ORC files, so you could safely say that a 128 Mb ORC file contains over 1 Million records. Sidebar: Which may of been why the default strip size in ORC's was changed to 64Mb, instead of 128Mb a few version back. The impact is multi-fold. With data locality, you move less data, process larger chunks of data at a time, generate fewer intermediate files, reduce impact to the Namenode and increase throughput overall, EVERYWHERE. The system moves away from being I/O bound to being CPU bound. Now you have the opportunity to tune container sizes to match "what" you're doing, because the container is actually "doing" a lot of work processing your data and not "manage" the job. Sometimes small files can't be avoided, but deal with them early, to limit the repetitive impact to your cluster. Here's a lists of general patterns to reduce the number of small files: Nifi - Use a combine processor to consolidate flows and aggregate data before if even gets to your cluster. Flume - Use a tiered Flume architecture to combine events from multiple inputs, producing "right" sized HDFS files for further refinement. Hive - Process the Small files regularly and often to produce larger files for "repetitive" processing. And in a classic pattern that incrementally "appends" to a dataset, creating a LOT of files over time, don't be afraid to go back and "reprocess" the file set again to streamline the impact on downstream tasks. Sqoop - Manager the number of mappers to generate appropriately size files. Oh, and if you NEED to keep those small files as "sources"... Archive them using hadoop archive resources 'har' and save your Namenode from the cost of managing those resource objects.
... View more
10-14-2015
09:34 AM
1 Kudo
Use NiFi to get the Data to HDFS and then Oozie Datasets to trigger actions based on data availability. Until NiFi, various versions of your describe method were common practice.
... View more
10-14-2015
09:14 AM
2 Kudos
Try creating a temporary database and move the table 'as is' into the new database. CREATE DATABASE if not exists Junk;
USE targetDB;
ALTER TABLE MyCorruptTable RENAME TO Junk.MyMovedCorruptTable;
DROP DATABASE JUNK Cascade;
... View more
10-14-2015
08:47 AM
1 Kudo
You need to increase the memory settings for Ambari. I ran into this a while back with certain views. I added/adjusted the following in: /var/lib/ambari-server/ambari-env.sh For "AMBARI_JVM_ARGS" -Xmx4G -XX:MaxPermSize=512m
... View more
10-13-2015
03:07 PM
1 Kudo
It should be GRANT ALL to just it's Oozie Database. Because the 'oozie' user needs to be able to create the schema in the target database.
... View more
10-13-2015
02:53 PM
1 Kudo
Using that key and signed into the Ambari Server as 'root', can you SSH to the target hosts from a command line? If you can't, double check the permissions of the "public" key on the target hosts. ~/.ssh should be 700 and the authorized_keys file should be 600.
... View more
10-13-2015
02:42 PM
9 Kudos
It's coming up more often now, the need to "add" host to an existing cluster without using the Ambari UI. I once thought the Blueprints were only useful to "initially" provision a cluster. But they're also quite helpful to extend your cluster as well. Provision the Initial Cluster using Auto-Discovery In the following example, we'll use a new feature in Ambari 2.1, called "Auto-Discovery" to provision the first node in our cluster. Do NOT manually register the Host with Ambari Server yet. We're going to demonstrate "Auto-Discovery" to initialize new hosts. Check that NO hosts have been registered with the cluster. curl -i -H "X-Requested-By: ambari" -u admin:xxxx -X GET http://<ambari-server:port>/api/v1/hosts Blueprint has be registered with Ambari {
"Blueprints": {
"stack_name": "HDP",
"stack_version": "2.3"
},
"configurations": [],
"host_groups": [
{
"cardinality": "1",
"components": [
{
"name": "ZOOKEEPER_SERVER"
},
{
"name": "ZOOKEEPER_CLIENT"
}
],
"configurations": [],
"name": "zookeeper_host_group"
},
{
"cardinality": "1",
"components": [
{
"name": "KAFKA_BROKER"
}
],
"configurations": [],
"name": "kafka_host_group"
}
]
} Save this above to a file and register the Blueprint with you cluster. curl -i -H "X-Requested-By: ambari" -u admin:admin -X POST -d @myblueprint.json http://ambari-server:8080/api/vi/blueprints/basic1 Create your cluster
Now lets create our cluster and use the Auto-Discovery feature to provision the host(s). This is the 'template' file we use to provision the cluster. {
"blueprint" : "basic1",
"host_groups" :[
{
"name" : "zookeeper_host_group",
"host_count" : "1",
"host_predicate" : "Hosts/cpu_count>0"
}
]
} At this point the cluster has been created in Ambari, but you should NOT have any hosts. Ambari Server is sitting around waiting for a host to be registered with the cluster matching the "predicate" above. The "host_count" is also an important part of the criteria. The process will not start until the number of host in the "host_count" are available. For our demonstration here, we've set it to 1. In the real world, you would need 3 to create a valid ZooKeeper quorum. Manually Register a Host (just one, for now)
Install an Ambari Agent on your host and configure it to talk to Ambari. The host should match the predicate above "cpu_count>0", which would be just about everything. But you get the point. Start the Ambari-Agent and return back to the Ambari Server UI. In short order, you should see operations kicking off to provision the node and add the service. Now you have a cluster with 1 node and a running ZooKeeper Service. Lets Expand our Cluster With a second host, lets manually register the agent with Ambari. Check via the API that the host has been registered with the Ambari Server curl -i -H "X-Requested-By: ambari" -u admin:xxxx -X GET http://<ambari-server:port>/api/v1/hosts Notice that the previously registered Blueprint "basic1", has a host group for Kafka Brokers. Let's add this newly registered host to our cluster as a Kakfa Broker. Create a file to contain the message body and save it. {
"blueprint" : "basic1",
"host_group" : "kafka_host_group"
} Add and provision the host with the services specified in the Blueprint. curl -i -H "X-Requested-By: ambari" -u admin:xxxx -X POST -d @kafka_host.json http://<ambari-server:port>/api/v1/clusters/<yourcluster>/hosts/<new_host_id>; The <new_host_id> is the name that it's been registered to Ambari by, should be the host FQDN. Double check the name returned by the hosts query above. This will add the host to you new cluster and provision it with the services configured in the Blueprint. Summary In the past, Blueprints were used to initialized clusters. Now you can use them to "extend" your cluster. Avoid trying to reverse engineer the process of installing and configuring services through the REST API, use "Blueprints". You'll save yourself a lot of effort!! References Ambari Blueprints
... View more
- Find more articles tagged with:
- Ambari
- auto-discovery
- blueprints
- Cloud & Operations
- extensibility
- provision
Labels:
10-02-2015
10:15 PM
3 Kudos
I've just installed and Kerberized my cluster: Ambari 2.1.1 CentOS 7 IPA 4 for LDAP and Kerberos (IPA Clients Configured across the cluster hosts) Oracle JDK 1.7.0_79 (with JCE) HDP 2.3.0 The cluster comes up just fine and all the services seem to be happy talking to each other. So I'm pretty convinced that all the keytabs are configured correctly. From any node on the cluster, after getting a valid ticket (kinit) and trying to do a basic hdfs command, I get (kerberos debug enabled): ONLY HAPPENS FROM IPA Clients. Other host client access works fine (read on). -sh-4.2$ klist
Ticket cache: KEYRING:persistent:100035:krb_ccache_T7mkWNw
Default principal: dstreev@HDP.LOCAL
Valid starting Expires Service principal
10/02/2015 09:17:07 10/03/2015 09:17:04 krbtgt/HDP.LOCAL@HDP.LOCAL
-sh-4.2$ hdfs dfs -ls .
Java config name: null
Native config name: /etc/krb5.conf
Loaded from native config
>>>KinitOptions cache name is /tmp/krb5cc_100035
15/10/02 18:07:48 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
15/10/02 18:07:48 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
15/10/02 18:07:48 INFO retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over m2.hdp.local/10.0.0.161:8020 after 1 fail over attempts. Trying to fail over immediately.
java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "m3.hdp.local/10.0.0.162"; destination host is: "m2.hdp.local":8020;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
at org.apache.hadoop.ipc.Client.call(Client.java:1431)
at org.apache.hadoop.ipc.Client.call(Client.java:1358)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2116)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1301)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
at org.apache.hadoop.fs.Globber.glob(Globber.java:252)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1655)
at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:326)
at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:235)
at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:218)
at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:201)
at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:287)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:340)
Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:685)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:648)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:735)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:373)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1493)
at org.apache.hadoop.ipc.Client.call(Client.java:1397)
... 28 more
Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:558)
at org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:373)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:727)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:723)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:722)
... 31 more
Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:121)
at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:223)
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:193)
... 40 more
15/10/02 18:07:48 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
(it retries 15-20 times before quitting, which happens really fast. If I try to access the cluster from a host that is NOT part of the IPA hosts (from my Mac, as an example). I do NOT get this error, and I can interact with the cluster. ➜ conf klist
Credentials cache: API:D44F3F89-A095-40A5-AA7C-BD06698AA606
Principal: dstreev@HDP.LOCAL
Issued Expires Principal
Oct 2 17:52:13 2015 Oct 3 17:52:00 2015 krbtgt/HDP.LOCAL@HDP.LOCAL
Oct 2 18:06:53 2015 Oct 3 17:52:00 2015 host/m3.hdp.local@HDP.LOCAL
➜ conf hdfs dfs -ls /
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
Java config name: null
Native config name: /etc/krb5.conf
Loaded from native config
15/10/02 18:10:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>>KinitOptions cache name is /tmp/krb5cc_501
>> Acquire default native Credentials
Using builtin default etypes for default_tkt_enctypes
default etypes for default_tkt_enctypes: 18 17 16 23.
>>> Obtained TGT from LSA: Credentials:
client=dstreev@HDP.LOCAL
server=krbtgt/HDP.LOCAL@HDP.LOCAL
authTime=20151002215213Z
startTime=20151002215213Z
endTime=20151003215200Z
renewTill=20151009215200Z
flags=FORWARDABLE;RENEWABLE;INITIAL;PRE-AUTHENT
EType (skey)=18
(tkt key)=18
15/10/02 18:10:59 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
Found ticket for dstreev@HDP.LOCAL to go to krbtgt/HDP.LOCAL@HDP.LOCAL expiring on Sat Oct 03 17:52:00 EDT 2015
Entered Krb5Context.initSecContext with state=STATE_NEW
Found ticket for dstreev@HDP.LOCAL to go to krbtgt/HDP.LOCAL@HDP.LOCAL expiring on Sat Oct 03 17:52:00 EDT 2015
Service ticket not found in the subject
>>> Credentials acquireServiceCreds: same realm
Using builtin default etypes for default_tgs_enctypes
default etypes for default_tgs_enctypes: 18 17 16 23.
>>> CksumType: sun.security.krb5.internal.crypto.RsaMd5CksumType
>>> EType: sun.security.krb5.internal.crypto.Aes256CtsHmacSha1EType
>>> KdcAccessibility: reset
>>> KrbKdcReq send: kdc=m3.hdp.local UDP:88, timeout=30000, number of retries =3, #bytes=654
>>> KDCCommunication: kdc=m3.hdp.local UDP:88, timeout=30000,Attempt =1, #bytes=654
>>> KrbKdcReq send: #bytes read=637
>>> KdcAccessibility: remove m3.hdp.local
>>> EType: sun.security.krb5.internal.crypto.Aes256CtsHmacSha1EType
>>> KrbApReq: APOptions are 00100000 00000000 00000000 00000000
>>> EType: sun.security.krb5.internal.crypto.Aes256CtsHmacSha1EType
Krb5Context setting mySeqNumber to: 227177742
Created InitSecContextToken:
0000: 01 00 6E 82 02 3C 30 82 02 38 A0 03 02 01 05 A1 ..n..<0..8......
0010: 03 02 01 0E A2 07 03 05 00 20 00 00 00 A3 82 01 ......... ......
...
0230: 99 AC EE FB DF 86 B5 2A 19 CB A1 0B 8A 8E F7 9B .......*........
0240: 81 08 ..
Entered Krb5Context.initSecContext with state=STATE_IN_PROCESS
>>> EType: sun.security.krb5.internal.crypto.Aes256CtsHmacSha1EType
Krb5Context setting peerSeqNumber to: 40282898
Krb5Context.unwrap: token=[05 04 01 ff 00 0c 00 00 00 00 00 00 02 66 ab 12 01 01 00 00 8e 14 7a df 34 d7 c5 3d 5d d1 ce b5 ]
Krb5Context.unwrap: data=[01 01 00 00 ]
Krb5Context.wrap: data=[01 01 00 00 ]
Krb5Context.wrap: token=[05 04 00 ff 00 0c 00 00 00 00 00 00 0d 8a 75 0e 01 01 00 00 9c a5 73 25 59 0f b5 64 24 f0 a8 78 ]
Found 8 items
drwxrwxrwx - yarn hadoop 0 2015-09-28 15:55 /app-logs
drwxr-xr-x - hdfs hdfs 0 2015-09-28 15:57 /apps
drwxr-xr-x - hdfs hdfs 0 2015-09-28 15:53 /hdp
drwxr-xr-x - mapred hdfs 0 2015-09-28 15:53 /mapred
drwxrwxrwx - mapred hadoop 0 2015-09-28 15:54 /mr-history
drwxr-xr-x - hdfs hdfs 0 2015-09-28 19:20 /ranger
drwxrwxrwx - hdfs hdfs 0 2015-09-29 13:09 /tmp
drwxr-xr-x - hdfs hdfs 0 2015-10-02 17:51 /user
➜ conf
Since I can get to the cluster and interact with it, from a host that hasn't been configured by the IPA client. I'm pretty sure that my IPA environment is tweaked. Any idea where to look in IPA to fix this for the hosts that are part of the IPA environment?
... View more
10-02-2015
09:35 PM
The KMS is also backed by a DB. Make sure kms db user/host combination has been granted permissions in MySQL (if that's what you're using).
... View more
10-02-2015
09:29 PM
Check the permissions (on the Oozie server) of the files in directory below. When updates are made to Hive, these files are updated. This is fairly new with Oozie, these files are automatically part of the path, so you don't need to add "hive" related site files to your Oozie workflows when doing Hive Actions. /etc/oozie/conf/action-conf/hive
... View more
10-02-2015
09:20 PM
2 Kudos
Example Maven Section: <repositories>
<repository>
<releases>
<enabled>true</enabled>
<updatePolicy>always</updatePolicy>
<checksumPolicy>warn</checksumPolicy>
</releases>
<snapshots>
<enabled>false</enabled>
<updatePolicy>never</updatePolicy>
<checksumPolicy>fail</checksumPolicy>
</snapshots>
<id>HDPReleases</id>
<name>HDP Releases</name>
<url>http://repo.hortonworks.com/content/repositories/releases/</url>
<layout>default</layout>
</repository>
...
... View more
10-02-2015
09:15 PM
1 Kudo
I think you've set the value in the wrong place. Try adding the value to your job.properties file and try again. In this case, you embedded it in an action, when I believe it's meant to be a job level variable.
... View more
10-02-2015
09:08 PM
8 Kudos
Goal
Create and Setup IPA Keytabs for HDP Tested on CentOS 7 with IPA 4, HDP 2.3.0 and Ambari 2.1.1 Assumptions
Ambari does NOT currently (2.1.x) support the automatic generation of keytabs against IPA IPA Server is already installed and IPA clients have been installed/configured on all HDP cluster nodes If you are using Accumulo, you need to create the user 'tracer' in IPA. A keytab for this user will be requested for Accumulo Tracer. We'll use Ambari's 'resources' directory to simplify the distribution of scripts and keys. Enable kerberos using wizard In Ambari, start security wizard by clicking Admin -> Kerberos and click Enable Kerberos. Then select "Manage Kerberos principals and key tabs manually" option Enter your realm (Optional, Read on before doing this...) Remove clustername from smoke/hdfs principals to remove the -${cluster_name} references to look like below
smoke user principal: ${cluster-env/smokeuser}@${realm} HDFS user principal: ${hadoop-env/hdfs_user}@${realm} HBase user principal: ${hbase-env/hbase_user}@${realm} Spark user principal: ${spark-env/spark_user}@${realm} Accumulo user principal: ${accumulo-env/accumulo_user}@${realm} Accumulo Tracer User: tracer@${realm} If you don't remove the "cluster-name" from above, Ambari will generate/use principal names that are specific to your cluster. This could be very important if you are supporting multiple clusters with the same IPA implementation. On next page download csv file but DO NOT click Next yet!
If you are deploying storm, the storm user maybe missing from the storm USER row. If you see something like the below: storm@HORTONWORKS.COM,USER,,/etc
replace the ,, with ,storm, storm@HORTONWORKS.COM,USER,storm,/etc
Copy above csv to the Ambari Server and place it in the /var/lib/ambari-server/resources directory, making sure to remove the header and any empty lines at the end. vi kerberos.csv
From any IPA Client Node Create principals and service accounts using csv file. ## authenticate
kinit admin
AMBARI_HOST=<ambari_host>
# Get the kerberos.csv file from Ambari
wget http://${AMBARI_HOST}:8080/resources/kerberos.csv -O /tmp/kerberos.csv
# Create IPA service entries.
awk -F"," '/SERVICE/ {print "ipa service-add --force "$3}' /tmp/kerberos.csv | sort -u > ipa-add-spn.sh
sh ipa-add-spn.sh
# Create IPA User accounts
awk -F"," '/USER/ {print "ipa user-add "$5" --first="$5" --last=Hadoop --shell=/sbin/nologin"}' /tmp/kerberos.csv | sort | uniq > ipa-add-upn.sh
sh ipa-add-upn.sh
On Ambari node authenticate and create the keytabs for the headless user accounts and initialize the service keytabs. ## authenticate
sudo echo '<kerberos_password>' | kinit --password-file=STDIN admin
## or (IPA 4)
sudo echo '<kerberos_password>' | kinit -X password-file=STDIN admin
Should be run as root (adjust for your Ambari Host/Port). AMBARI_HOST_PORT=<ambari_host>
wget http://${AMBARI_HOST_PORT}/resources/kerberos.csv -O /tmp/kerberos.csv
ipa_server=$(grep server /etc/ipa/default.conf | awk -F= '{print $2}')
if [ "${ipa_server}X" == "X" ]; then
ipa_server=$(grep host /etc/ipa/default.conf | awk -F= '{print $2}')
fi
if [ -d /etc/security/keytabs ]; then
mv -f /etc/security/keytabs /etc/security/keytabs.`date +%Y%m%d%H%M%S`
fi
mkdir -p /etc/security/keytabs
chown root:hadoop /etc/security/keytabs/
if [ ! -d /var/lib/ambari-server/resources/etc/security/keytabs ]; then
mkdir -p /var/lib/ambari-server/resources/etc/security/keytabs
fi
grep USER /tmp/kerberos.csv | awk -F"," '{print "ipa-getkeytab -s '${ipa_server}' -p "$3" -k "$6";chown "$7":"$9,$6";chmod "$11,$6}' | sort -u > gen_keytabs.sh
# Copy the 'user' keytabs to the Ambari Resources directory for distribution.
echo "cp -f /etc/security/keytabs/*.* /var/lib/ambari-server/resources/etc/security/keytabs/" >> gen_keytabs.sh
# ReGenerate Keytabs for all the required Service Account, EXCEPT for the HTTP service account on the IPA Server host.
grep SERVICE /tmp/kerberos.csv | awk -F"," '{print "ipa-getkeytab -s '${ipa_server}' -p "$3" -k "$6";chown "$7":"$9,$6";chmod "$11,$6}' | sort -u | grep -v HTTP\/${ipa_server} >> gen_keytabs.sh
# Allow the 'admins' group to retrieve the keytabs.
grep SERVICE /tmp/kerberos.csv | awk -F"," '{print "ipa service-allow-retrieve-keytab "$3" --group=admins"}' | sort -u >> gen_keytabs.sh
bash ./gen_keytabs.sh
# Now remove the keytabs, they'll be replaced by the distribution phase.
mv -f /etc/security/keytabs /etc/security/genedkeytabs.`date +%Y%m%d%H%M%S`
mkdir /etc/security/keytabs
chown root:hadoop /etc/security/keytabs
Build a distribution script used to create host specific keytabs (adjust for your Ambari Host/Port). vi retrieve_keytabs.sh
# Set the location of Ambari
AMBARI_HOST_PORT=<ambari_host>
# Retrieve the kerberos.csv file from the wizard
wget http://${AMBARI_HOST_PORT}/resources/kerberos.csv -O /tmp/kerberos.csv
ipa_server=$(grep server /etc/ipa/default.conf | awk -F= '{print $2}')
if [ "${ipa_server}X" == "X" ]; then
ipa_server=$(grep host /etc/ipa/default.conf | awk -F= '{print $2}')
fi
if [ ! -d /etc/security/keytabs ]; then
mkdir -p /etc/security/keytabs
fi
chown root:hadoop /etc/security/keytabs/
# Retrieve WITHOUT recreating the existing keytabs for each account.
grep USER /tmp/kerberos.csv | awk -F"," '/'$(hostname -f)'/ {print "wget http://'$( echo ${AMBARI_HOST_PORT})'/resources"$6" -O "$6";chown "$7":"$9,$6";chmod "$11,$6}' | sort -u > get_host_keytabs.sh
grep SERVICE /tmp/kerberos.csv | awk -F"," '/'$(hostname -f)'/ {print "ipa-getkeytab -s '$(echo $ipa_server)' -r -p "$3" -k "$6";chown "$7":"$9,$6";chmod "$11,$6}' | sort -u >> get_host_keytabs.sh
bash ./get_host_keytabs.sh
Copy file to the Ambari Servers resource directory for distribution. scp retrieve_keytabs.sh root@<ambari_host>:/var/lib/ambari-server/resources
On the Each HDP node via 'pdsh' authenticate and create the keytabs # Should be logging in as 'root'
pdsh -g <host_group> -l root
> ## authenticate as the KDC Admin
> echo '<kerberos_password>' | kinit --password-file=STDIN admin
> # or (for IPA 4)
> echo '<kerberos_password>' | kinit -X password-file=STDIN admin
> wget http://<ambari_host>:8080/resources/retrieve_keytabs.sh -O /tmp/retrieve_keytabs.sh
> bash /tmp/retrieve_keytabs.sh
> ## Verify kinit works before proceeding (should not give errors)
> # Service Account Check (replace REALM with yours)
> sudo -u hdfs kinit -kt /etc/security/keytabs/nn.service.keytab nn/$(hostname -f)@HORTONWORKS.COM
> # Headless Check (check on multiple hosts)
> sudo -u ambari-qa kinit -kt /etc/security/keytabs/smokeuser.headless.keytab ambari-qa@HORTONWORKS.COM
> sudo -u hdfs kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs@HORTONWORKS.COM
WARNING: IPA UI maybe broken after above procedure When the process to build keytabs for services is run on the same host that IPA lives on, it will invalidate the keytab used by Apache HTTPD to authenticate. I've added a step that should eliminate the "re"-creation of the key tab, but just incase.. Replace /etc/httpd/conf/ipa.keytab with /etc/security/keytabs/spnego.service.keytab cd /etc/httpd/conf
mv ipa.keytab ipa.keytab.orig
cp /etc/security/keytabs/spnego.service.keytab ipa.keytab
chown apache:apache ipa.keytab
service httpd restart
Remove the headless.keytabs.tgz file from /var/lib/ambari-server/resources on the Ambari-Server. Press next on security wizard and proceed to stop services Proceed with the next steps of wizard by clicking Next Once completed, click Complete and now the cluster is kerborized Using your Kerberized cluster Try to run commands without authenticating to kerberos. $ hadoop fs -ls /
15/07/15 14:32:05 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
$ curl -u someuser -skL "http://$(hostname -f):50070/webhdfs/v1/user/?op=LISTSTATUS"
<title>Error 401 Authentication required</title>
Get a token ## for the current user
sudo su - gooduser
kinit
## for any other user
kinit someuser
Use the cluster Hadoop Commands $ hadoop fs -ls /
Found 8 items
[...]
WebHDFS ## note the addition of `--negotiate -u : `
curl -skL --negotiate -u : "http://$(hostname -f):50070/webhdfs/v1/user/?op=LISTSTATUS"
Hive (using Beeline or another Hive JDBC client)
Hive in Binary mode (the default) beeline -u "jdbc:hive2://localhost:10000/default;principal=hive/$(hostname -f)@HORTONWORKS.COM"
Hive in HTTP mode ## note the update to use HTTP and the need to provide the kerberos principal.
beeline -u "jdbc:hive2://localhost:10001/default;transportMode=http;httpPath=cliservice;principal=HTTP/$(hostname -f)@HORTONWORKS.COM"
Thank you to @abajwa@hortonworks.com (Ali Bajwa) for his original workshop this is intended to extend. https://github.com/abajwa-hw/security-workshops
... View more
10-02-2015
08:51 PM
5 Kudos
The Phoenix documentation here leaves out a few pieces in order to make a successful connection to HBase, through the Phoenix Driver. They assume that the connection is from the 'localhost'. May work great, but that's unlikely in the real world. Required Jars phoenix-client.jar hbase-client.jar (not mentioned in the Phoenix Docs) URL Details The full adbc connection URL syntax for phoenix is: Basic Connections
jdbc:phoneix[:zk_quorum][:zk_port][:zk_hbase_path] The "zk_quorum" is a comma separated list of the ZooKeeper Servers. The "zk_port" is the ZooKeeper port. The "zk_hbase_path" is the path used by Hbase to stop information about the instance. On a 'non' kerberized cluster the default zk_hbase_path for HDP is '/hbase-unsecure'. Sample JDBC URL
jdbc:phoenix:m1.hdp.local,m2.hdp.local,d1.hdp.local: 2181 :/hbase-unsecure
For Kerberos Cluster Connections
jdbc:phoneix[:zk_quorum][:zk_port][:zk_hbase_path][:headless_keytab_file:principal]
Sample JDBC URL
jdbc:phoenix:m1.hdp.local,m2.hdp.local,d1.hdp.local: 2181 :/hbase-secure:/Users/dstreev/keytabs/myuser.headless.keytab:dstreev @HDP .LOCAL They say that the above items in the url beyond 'phoenix' are optional if the clusters 'hbase-site.xml' file is in the path. I found that when I presented a copy of the 'hbase-site.xml' from the cluster and left off the optional elements, I got errors referencing 'Failed to create local dir'. When I connected successfully from DBVisualizer to HBase with the Phoenix JDBC Driver, it took about 10-20 seconds to connect.
... View more
- Find more articles tagged with:
- Cloud & Operations
- FAQ
- jdbc
- Phoenix
Labels: