Member since
07-30-2019
53
Posts
136
Kudos Received
16
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
6561 | 01-30-2017 05:05 PM | |
3054 | 01-13-2017 03:46 PM | |
1592 | 01-09-2017 05:36 PM | |
827 | 01-09-2017 05:29 PM | |
714 | 10-07-2016 03:34 PM |
10-07-2016
03:04 PM
4 Kudos
The journal nodes can be quite IO intensive, while the Namenode is generally more memory and CPU intensive. So one could justify co-locating them. BUT, when it comes to checkpointing, they could conflict. More importantly, delays in writing for the journal node will impact the namenode and result in higher RPC Queue times. With a cluster that size, I would always want to run the namenode by itself. It's far too important to compromise it by co-locating it with another highly active service. And regarding the Journal Node, don't store the journal directories on an LVM that's shared with the OS. Again, the Journal Node is quite IO intensive. And I've seen it project slowness back to the Namenode (in RPC queue times) when they are competing with the OS because they are sharing the same physical disks.
... View more
03-22-2016
09:51 PM
4 Kudos
Repo Description Sessions remember directory context, 'tab' completion, kerberos support, initialization scripts and a few new 'hdfs' features you wish you had. New extensions that help gather runtime statistics from the Namenode, Scheduler, Job History Server and Container Usage. Added support for "lsp" directory listing "plus" that identifies file information PLUS block information and location. Helpful when determining how well your data is distributed across the cluster and for identifying small file issues. Repo Info Github Repo URL https://github.com/dstreev/hadoop-cli Github account name dstreev Repo name hadoop-cli
... View more
Labels:
11-17-2015
02:22 PM
21 Kudos
There are five ways to connect to HS2 with JDBC Direct - Binary Transport mode (Non-Secure|Secure) Direct - HTTP Transport mode (Non-Secure|Secure) ZooKeeper - Binary Transport mode (Non-Secure|Secure) ZooKeeper - HTTP Transport mode (Non-Secure|Secure) via Knox - HTTP Transport mode
Connecting to HS2 via ZooKeeper (3-4) (and knox, if backed by ZooKeeper) provides a level of failover that you can't get directly. When connecting through ZooKeeper, the client is provided server connection information from a list of available servers. This list is managed on the backend and the client isn't aware of them, before the connection. This allows administrators to add additional servers to the list without reconfiguring the clients. NOTE: HS2 in this configuration is considered a Failover and is not automatic once a connection has been established. JDBC connections are stateful. The data and session information kept on HS2 for a connection is LOST when the server goes down. Jobs currently in progress, will be affected. You will need to "reconnect" to continue. At which time, you will be able to resubmit your job. Once an HS2 instance goes down, ZooKeeper will not forward connection requests to that server. By reconnecting, after an HS2 failure, you will connect to a working HS2 instance. URL Syntax jdbc:hive2://zookeeper_quorum|hs2_host:port/[db][;principal=<hs2_principal>/<hs2_host>|_HOST@<KDC_REALM>][;transportMode=binary|http][;httpPath=<http_path>][;serviceDiscoveryMode=zookeeper;zooKeeperNamespace=<zk_namespace>][;ssl=true|false][;sslKeyStore=<key_store_path>][;keyStorePassword=<key_store_password][;sslTrustStore=<trust_store_path>][;trustStorePassword=<trust_store_password>][;twoWay=true|false]
Assumptions: HS2 Host(s): m1.hdp.local and m2.hdp.local HS2 Binary Port: 10010 HS2 HTTP Port: 10011 ZooKeeper Quorom: m1.hdp.local:2181,m2.hdp.local:2181:m3.hdp.local:2181 HttpPath: cliservice HS2 ZooKeeper Namespace: hiveserver2 User: barney Password: bedrock NOTE: <db> is the database in the examples below and is optional. The leading slash '/' is required. WARNING: When using 'beeline' and specifying the connection url (-u) at the command line, be sure to quote the url. Non-Secure Environments Direct - Binary Transport Mode beeline -n barney -p bedrock -u "jdbc:hive2://m1.hdp.local:10010/<db>" Direct - HTTP Transport Mode beeline -n barney -p bedrock -u "jdbc:hive2://m1.hdp.local:10011/<db>;transportMode=http;httpPath=cliservice" ZooKeeper - Binary Transport Mode beeline -n barney -p bedrock -u "jdbc:hive2://m1.hdp.local:2181,m2.hdp.local:2181,m3.hdp.local:2181/<db>" ZooKeeper - Http Transport Mode beeline -n barney -p bedrock -u "jdbc:hive2://m1.hdp.local:2181,m2.hdp.local:2181,m3.hdp.local:2181/<db>;transportMode=http;httpPath=cliservice" Alternate Connectivity Thru Knox jdbc:hive2://<knox_host>:8443/;ssl=true;sslTrustStore=/var/lib/knox/data/security/keystores/gateway.jks;trustStorePassword=<password>?hive.server2.transport.mode=http;hive.server2.thrift.http.path=gateway/<CLUSTER>/hive Secure Environments Additional Assumptions KDC Realm: HDP.LOCAL HS2 Principal: hive The 'principal' used in the below examples can use either the fqdn of the HS2 Host in the principal or '_HOST'. '_HOST' is globally replaced based on your Kerberos configuration if you haven't altered the default Kerberos Regex patterns in ... NOTE: The client is required to 'kinit' before connecting through JDBC. The -n and -p (user / password) aren't necessary. They are handled by the Kerberos Ticket Principal. Direct - Binary Transport Mode beeline -u "jdbc:hive2://m1.hdp.local:10010/<db>;principal=hive/_HOST@HDP.LOCAL" Direct - HTTP Transport Mode beeline -u "jdbc:hive2://m1.hdp.local:10011/<db>;principal=hive/_HOST@HDP.LOCAL;transportMode=http;httpPath=cliservice" ZooKeeper - Binary Transport Mode beeline -u "jdbc:hive2://m1.hdp.local:2181,m2.hdp.local:2181,m3.hdp.local:2181/<db>;principal=hive/_HOST@HDP.LOCAL" ZooKeeper - Http Transport Mode
beeline -u "jdbc:hive2://m1.hdp.local:2181,m2.hdp.local:2181,m3.hdp.local:2181/<db>;principal=hive/_HOST@HDP.LOCAL;transportMode=http;httpPath=cliservice"
... View more
Labels:
11-17-2015
10:20 AM
1 Kudo
Almost forgot... Another side affect of small files on a cluster shows up while running the Balancer. You'll move LESS data, and increase the impact on the Namenode even further. So YES, small files are bad. But only as bad as you're willing to "pay" for in most cases. If you don't mind doubling the size of the cluster to address some files, then do it. But I wouldn't. I'd do a bit of planning and refinement and bank the extra nodes for more interesting projects. 🙂
... View more
11-17-2015
10:15 AM
22 Kudos
I've seen several systems with 400+ million objects represented in the Namenode without issues. In my opinion, that's not the "right" question though. Certainly, the classic answer to small files has been the pressure it put's on the Namenode but that's only a part of the equation. And with hardware / cpu and increase memory thresholds, that number has certainly climbed over the years since the small file problem was documented. The better question is: How do small files "impact" cluster performance? Everything is a trade-off when dealing with data at scale. The impact of small files, beyond the Namenode pressures, is more specifically related to "job" performance. Under classic MR, the number of small files controls the number of mappers required to perform a job. Of course, there are tricks to "combine" inputs and reduce this, but that leads to a lot of data back planing and increased cluster I/O chatter. A mapper in the classic sense, is a costly resource to allocate. If the actual task done by the mapper is rather mundane, most of the time spent accomplishing your job can be "administrative" in nature with the construction and management of all those resources. Consider the impact to a cluster when this happens. For example, I had a client once that was trying to get more from their cluster but there was a job that was processing 80,000 files. Which lead to the creation of 80,000 mappers. Which lead to consuming ALL the cluster resources, several times over. Follow that path a bit further and you'll find that the impact on the Namenode is exacerbated with all of the intermediate files generated by the mapper for the shuffle/sort phases. That's the real impact on a cluster. A little work in the beginning can have a dramatic affect on the downstream performance of your jobs. Take the time to "refine" your data and consolidate your files. Here's another way to approach it, which is even more evident when dealing with ORC files. Processing a 1Mb file has an overhead to it. So processing 128 1Mb files will cost you 128 times more "administrative" overhead, versus processing 1 128Mb file. In plain text, that 1Mb file may contain 1000 records. The 128 Mb file might contain 128000 records. And I've typically seen 85-92% compression ratio with ORC files, so you could safely say that a 128 Mb ORC file contains over 1 Million records. Sidebar: Which may of been why the default strip size in ORC's was changed to 64Mb, instead of 128Mb a few version back. The impact is multi-fold. With data locality, you move less data, process larger chunks of data at a time, generate fewer intermediate files, reduce impact to the Namenode and increase throughput overall, EVERYWHERE. The system moves away from being I/O bound to being CPU bound. Now you have the opportunity to tune container sizes to match "what" you're doing, because the container is actually "doing" a lot of work processing your data and not "manage" the job. Sometimes small files can't be avoided, but deal with them early, to limit the repetitive impact to your cluster. Here's a lists of general patterns to reduce the number of small files: Nifi - Use a combine processor to consolidate flows and aggregate data before if even gets to your cluster. Flume - Use a tiered Flume architecture to combine events from multiple inputs, producing "right" sized HDFS files for further refinement. Hive - Process the Small files regularly and often to produce larger files for "repetitive" processing. And in a classic pattern that incrementally "appends" to a dataset, creating a LOT of files over time, don't be afraid to go back and "reprocess" the file set again to streamline the impact on downstream tasks. Sqoop - Manager the number of mappers to generate appropriately size files. Oh, and if you NEED to keep those small files as "sources"... Archive them using hadoop archive resources 'har' and save your Namenode from the cost of managing those resource objects.
... View more
10-14-2015
09:34 AM
1 Kudo
Use NiFi to get the Data to HDFS and then Oozie Datasets to trigger actions based on data availability. Until NiFi, various versions of your describe method were common practice.
... View more
10-14-2015
09:14 AM
2 Kudos
Try creating a temporary database and move the table 'as is' into the new database. CREATE DATABASE if not exists Junk;
USE targetDB;
ALTER TABLE MyCorruptTable RENAME TO Junk.MyMovedCorruptTable;
DROP DATABASE JUNK Cascade;
... View more
10-14-2015
08:47 AM
1 Kudo
You need to increase the memory settings for Ambari. I ran into this a while back with certain views. I added/adjusted the following in: /var/lib/ambari-server/ambari-env.sh For "AMBARI_JVM_ARGS" -Xmx4G -XX:MaxPermSize=512m
... View more
10-13-2015
03:07 PM
1 Kudo
It should be GRANT ALL to just it's Oozie Database. Because the 'oozie' user needs to be able to create the schema in the target database.
... View more
10-13-2015
02:53 PM
1 Kudo
Using that key and signed into the Ambari Server as 'root', can you SSH to the target hosts from a command line? If you can't, double check the permissions of the "public" key on the target hosts. ~/.ssh should be 700 and the authorized_keys file should be 600.
... View more