Member since
07-30-2019
53
Posts
135
Kudos Received
16
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2480 | 01-30-2017 05:05 PM | |
1318 | 01-13-2017 03:46 PM | |
677 | 01-09-2017 05:36 PM | |
348 | 01-09-2017 05:29 PM | |
250 | 10-07-2016 03:34 PM |
06-13-2018
10:13 AM
You could run Teragen/Sort for this. Here's a script on my gist.github.com page that can be run against an HDP cluster for this. You can control the size, mappers and reducers from the commandline, even experiment with block sizes.
... View more
04-04-2018
11:42 AM
Does the user have access (File System Level) to the warehouse directory you've specified? The docs seem to indicate that the 'spark.sql.warehouse.dir' is optional when Hive is already present and you're attaching to a metastore. --- Users who do not have an existing Hive deployment can still enable Hive support. When not configured by the hive-site.xml , the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir , which defaults to the directory spark-warehouse in the current directory that the Spark application is started. --- Try omitting that setting from your application.
... View more
04-04-2018
11:31 AM
Are all the nodes sharing the same user/group mapping? The NN is responsible for doing the group lookup for the user. So if the user/group mapping isn't present there, your results will not match.
... View more
06-29-2017
05:53 PM
2 Kudos
That will only set it for newly created directories. Using the HDFS client, set the replication factor for the directory to the new value.
... View more
03-06-2017
02:08 PM
1 Kudo
The packaging isn't any different, although the embedded directory structure may not match the public repo. Every company I've worked with that manages local repo's usually has their own directory structure anyhow. You're more than welcome to alter that directory structure to fit your environment.
... View more
01-30-2017
05:05 PM
Rahul, Are the logs making it to HDFS? It sounds like you might be combining the "spooling" directory with the "local audit archive directory". What properties did you use during the Ranger HDFS Plugin installation? Are you doing a manual install or using Ambari? If manual, then this reference might help: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_command-line-installation/content/installing_ranger_plugins.html#installing_ranger_hdfs_plugin I wasn't able to locate your "...filespool.archive.dir" property on my cluster. I'm not sure the property is required. And may be responsible for keeping the files "locally" that you've already posted to HDFS. If the files are making it to HDFS, I would try removing this setting. What do you have set for the property below? And are the contents being flushed from that location on a regular basis? xasecure.audit.destination.hdfs.batch.filespool.dir Compression doesn't happen during this process. Once they're on HDFS, you're free to do with them as you see fit. If compression is a part of that, then write an MR job to do so. (WARNING: Could affect other systems that might want to use these files as is) Cheers, David
... View more
01-27-2017
01:29 PM
Those are intermediate directories used to store the stream of activity locally, before it's written to HDFS. You should have destination directories in HDFS for the final resting place. In my experience, when this issue happens and you don't see those directories in HDFS. It could be a permissions issue or the fact that the directories just need to be created manually. You may need to create the directories in HDFS manually and ensure they have the proper ACL's to allow them the be written to by the process.
... View more
01-13-2017
03:46 PM
distcp recognizes the s3[a] protocols from the default libraries already available in Hadoop. For example: Moving data from Hadoop to S3. hadoop distcp <current_cluster_folder> s3[a]://<bucket_info> If you're looking for ways to manage access (via AWS Keys) to S3 Buckets in Hadoop, this article is a great secure way to do that. https://community.hortonworks.com/articles/59161/using-hadoop-credential-api-to-store-aws-secrets.html
... View more
01-13-2017
03:23 PM
Jacqualin, Yes, the local dir and log dir both support multiple locations. And I advise using multiple locations to scale better. These directories aren't HDFS and therefore don't support hdfs replication, but that's ok. It's used for file caches and intermediate data. If you lose a drive in the middle of processing, only the "task" is affected, which may fail. In this case, the task is rescheduled somewhere else. So the job would be affected. A failed drive in yarn_local_dir is ok, as the NodeManager with tag it and not use it going forward. One more reason to have more than 1 drive specified here. BUT, in older versions of YARN, a failed drive can prevent the NodeManager from "starting" or "restarting." It's pretty clear in the logs of the NodeManager if you have issues with it starting at any time. Yarn also indicated drive failures in the Resource Manager UI. A Newer version of YARN is a bit more forgiving on startup.
... View more
01-09-2017
05:36 PM
1 Kudo
Having multiple values here allows for better scalability and performance for YARN and intermediate writes/reads. Much like HDFS has multiple directories (preferably on different mount point/physical drives), YARN LOCAL dirs can use this to spread the IO load. I also seen trends where customers use SSD drives for YARN LOCAL DIRS, which can significantly improve job performance. IE: 12 drive system. 8 drives are SATA drives for HDFS directories and 4 drives are smaller, fast SSD drives for YARN LOCAL DIRS.
... View more
01-09-2017
05:29 PM
Could you please identify which version of Ambari you are running? In these situations, I usually drop down to the host that is presenting the issue and try to run the command on the host. This may help provide a bit more detail on the actual issue. In this case, you may find that you need to remove the offending package yum erase <specific_package>, then have Ambari try to reinstall the packages.
... View more
12-15-2016
01:40 PM
9 Kudos
The Problem Traditional 'distcp' from one directory to another or from cluster to cluster is quite useful in moving massive amounts of data, once. But what happens when you need to "update" a target directory or cluster with only the changes made since the last 'distcp' had run. That becomes a very tricky scenario. 'distcp' offers an '-update' flag, which is suppose to move only the files that have changed. In this case 'distcp' will pull a list of files and directories from the source and targets, compare them and then build a migration plan. The problem: It's an expensive and time-consuming task. Furthermore, the process is not atomic. First, the cost of gathering a list of files and directories, along with their metadata is expensive when you're considering sources with millions of file and directory objects. And this cost is incurred on both the source and target namenode's, resulting in quite a bit of pressure on those systems. It's up to 'distcp' to reconcile the difference between the source and target, which is very expensive. When it's finally complete, only then does the process start to move data. And if data changes while the process is running, those changes can impact the transfer and lead to failure and partial migration. The Solution The process needs to be atomic, and it needs to be efficient. With Hadoop 2.0, HDFS introduce "snapshots." HDFS "snapshots" are a point-in-time copy of the directories metadata. The copy is stored in a hidden location and maintains references to all of the immutable filesystem objects. Creating a snapshot is atomic, and the characteristics of HDFS (being immutable) means that an image of a directories metadata doesn't require an addition copy of the underlying data. Another feature of snapshots is the ability to efficiently calculate changes between 'any' two snapshots on the same directory. Using 'hdfs snapshotDiff ', you can build a list of "changes" between these two point-in-time references. For Example [hdfs@m3 ~]$ hdfs snapshotDiff /user/dstreev/stats s1 s2
Difference between snapshot s1 and snapshot s2 under directory /user/dstreev/stats:
M .
+ ./attempt
M ./namenode/fs_state/2016-12.txt
M ./namenode/nn_info/2016-12.txt
M ./namenode/top_user_ops/2016-12.txt
M ./scheduler/queue_paths/2016-12.txt
M ./scheduler/queue_usage/2016-12.txt
M ./scheduler/queues/2016-12.txt
Let's take the 'distcp' update concept and supercharge it with the efficiency of snapshots. Now you have a solution that will scale far beyond the original 'distcp -update.' and in the process remove the burden and load from the namenode's previously encountered. Pre-Requisites and Requirements Source must support 'snapshots' hdfs dfsadmin -allowSnapshot <path> Target is "read-only" Target, after initial baseline 'distcp' sync needs to support snapshots. Process Identify the source and target 'parent' directory Do not initially create the destination directory, allow the first distcp to do that. For example: If I want to sync source `/data/a` with `/data/a_target`, do *NOT* pre-create the 'a_target' directory. Allow snapshots on the source directory hdfs dfsadmin -allowSnapshot /data/a Create a Snapshot of /data/a hdfs dfs -createSnapshot /data/a s1 Distcp the baseline copy (from the atomic snapshot). Note: /data/a_target does NOT exists prior to the following command. hadoop distcp /data/a/.snapshot/s1 /data/a_target Allow snapshots on the newly create target directory hdfs dfsadmin -allowSnapshot /data/a_target At this point /data/a_target should be considered "read-only". Do NOT make any changes to the content here. Create a matching snapshot in /data/a_target that matches the name of the snapshot used to build the baseline hdfs dfs -createSnapshot /data/a_target s1 Add some content to the source directory /data/a. Make changes, add, deletes, etc. that need to be replicated to /data/a_target. Take a new snapshot of /data/a hdfs dfs -createSnapshot /data/a s2 Just for fun, check on whats changed between the two snapshots hdfs snapshotDiff /data/a s1 s2 Ok, now let's migrate the changes to /data/a_target hadoop distcp -diff s1 s2 -update /data/a /data/a_target When that's completed, finish the cycle by creating a matching snapshot on /data/a_target hdfs dfs -createSnapshot /data/a_target s2 That's it. You've completed the cycle. Rinse and repeat. A Few Hints Remember, snapshots need to be managed manually. They will stay around forever unless you clean them up with: hdfs dfs -deleteSnapshot As long as a snapshot exists, the data exists. Deleting, even with skipTrash, data from a directory that has a snapshot, doesn't free up space. Only when all "references" to that data are gone, can space be reclaimed. Initial migrations of data between systems are very expensive in regards to network I/O. And you probably don't want to have to do that again, ever. I recommend keeping a snapshot of the original copy on each system OR some major checkpoint you can go back to, in the event the process is compromised. If 'distcp' can't validate that the snapshot (by name) between the source and the target are the same and that the data at the target hasn't changed since the snapshot, the process will fail. If the failure is because the directory has been updated, you'll need to use the above baseline snapshots to restore it without having to migrate all that data again. And then start the process up again.
... View more
- Find more articles tagged with:
- disaster-recovery
- Distcp
- Hadoop Core
- HDFS
- How-ToTutorial
- snapshot
Labels:
10-10-2016
12:38 PM
The dfs.replication.factor is applied to a folder at the time of folder creation. In your case, the folder has that setting already, regardless of what you set in Ambari. You need to "reset" it for the directory. IE: 'hdfs dfs -setrep -R <dir>'
... View more
10-08-2016
01:35 PM
1 Kudo
The only other thing that comes to mind is the existence of the "XXX" user on the Namenode and/or their association with the group "hadoop". If they aren't in the group "hadoop", you may find a setting in the hadoop-policy.xml file called: security.client.protocol.acl that is set to "hadoop". This is a way to prevent users not in this group from accessing HDFS. Note that the user account must exist on the Namenode as well. When you project a request from an Edgenode, where you obviously have the user account, the id (string version) is sent to the Namenode. The Namenode is responsible for "authorization" and does a group lookup of the user on the Namenode Host. If the user doesn't exist here OR their groups aren't the same as they were on the Edgenode where you launch the Hive command from, you'll have issues like this.
... View more
10-07-2016
03:43 PM
Is the cluster Kerberized? And if so, are you running with a principal for the "XXX" user or the "YYY" user?
... View more
10-07-2016
03:34 PM
Enabling Ranger audit's will show who made the sql call and what query was issued to HS2. This is more "metadata" centric, the actually data transferred is not logged in any permanent fashion. That would be the responsibility of the client. But the combination of the audit (who and what) along with possibly a "hdfs snapshot" can lead to a reproducible scenario.
... View more
10-07-2016
03:29 PM
2 Kudos
Namenode Federation is the assignment of directories to a pair of HA NN's, like a mount point on linux. The process of directing clients to the right Namenode is the responsibility of the HDFS client. And if you're using the standard HDFS libraries, then having the "properly" configured hdfs-site.xml file in the path of the java client will handle that transparently. This config file (hdfs-site.xml) will contain all the NN HA and Federation settings needed.
... View more
10-07-2016
03:23 PM
2 Kudos
You can force the "replication" of under replicated blocks by issuing the setrep command on the file/directory. I use this technique to excellerate under-replicated blocks before an upgrade attempt to get to an optimal state. Otherwise, you're at the mercy of the namenode to schedule the process.
... View more
10-07-2016
03:17 PM
1 Kudo
You'll have to re-apply the replication factor the the directories you're seeing this warning on. The dfs.replication setting is applied to directories/files at time of creation. And if the cluster was initially created when the value was set to 3 (default), then all the clusters files and folders created at that time will have this applied already. You'll need to reset it for those directories. And new replication factor will be picked up for new files.
... View more
10-07-2016
03:04 PM
4 Kudos
The journal nodes can be quite IO intensive, while the Namenode is generally more memory and CPU intensive. So one could justify co-locating them. BUT, when it comes to checkpointing, they could conflict. More importantly, delays in writing for the journal node will impact the namenode and result in higher RPC Queue times. With a cluster that size, I would always want to run the namenode by itself. It's far too important to compromise it by co-locating it with another highly active service. And regarding the Journal Node, don't store the journal directories on an LVM that's shared with the OS. Again, the Journal Node is quite IO intensive. And I've seen it project slowness back to the Namenode (in RPC queue times) when they are competing with the OS because they are sharing the same physical disks.
... View more
03-22-2016
09:51 PM
4 Kudos
Repo Description Sessions remember directory context, 'tab' completion, kerberos support, initialization scripts and a few new 'hdfs' features you wish you had. New extensions that help gather runtime statistics from the Namenode, Scheduler, Job History Server and Container Usage. Added support for "lsp" directory listing "plus" that identifies file information PLUS block information and location. Helpful when determining how well your data is distributed across the cluster and for identifying small file issues. Repo Info Github Repo URL https://github.com/dstreev/hadoop-cli Github account name dstreev Repo name hadoop-cli
... View more
- Find more articles tagged with:
- Cloud & Operations
- hadoop
- HDFS
- namenode
- statistics
- utilities
Labels:
11-17-2015
02:22 PM
21 Kudos
There are five ways to connect to HS2 with JDBC Direct - Binary Transport mode (Non-Secure|Secure) Direct - HTTP Transport mode (Non-Secure|Secure) ZooKeeper - Binary Transport mode (Non-Secure|Secure) ZooKeeper - HTTP Transport mode (Non-Secure|Secure) via Knox - HTTP Transport mode
Connecting to HS2 via ZooKeeper (3-4) (and knox, if backed by ZooKeeper) provides a level of failover that you can't get directly. When connecting through ZooKeeper, the client is provided server connection information from a list of available servers. This list is managed on the backend and the client isn't aware of them, before the connection. This allows administrators to add additional servers to the list without reconfiguring the clients. NOTE: HS2 in this configuration is considered a Failover and is not automatic once a connection has been established. JDBC connections are stateful. The data and session information kept on HS2 for a connection is LOST when the server goes down. Jobs currently in progress, will be affected. You will need to "reconnect" to continue. At which time, you will be able to resubmit your job. Once an HS2 instance goes down, ZooKeeper will not forward connection requests to that server. By reconnecting, after an HS2 failure, you will connect to a working HS2 instance. URL Syntax jdbc:hive2://zookeeper_quorum|hs2_host:port/[db][;principal=<hs2_principal>/<hs2_host>|_HOST@<KDC_REALM>][;transportMode=binary|http][;httpPath=<http_path>][;serviceDiscoveryMode=zookeeper;zooKeeperNamespace=<zk_namespace>][;ssl=true|false][;sslKeyStore=<key_store_path>][;keyStorePassword=<key_store_password][;sslTrustStore=<trust_store_path>][;trustStorePassword=<trust_store_password>][;twoWay=true|false]
Assumptions: HS2 Host(s): m1.hdp.local and m2.hdp.local HS2 Binary Port: 10010 HS2 HTTP Port: 10011 ZooKeeper Quorom: m1.hdp.local:2181,m2.hdp.local:2181:m3.hdp.local:2181 HttpPath: cliservice HS2 ZooKeeper Namespace: hiveserver2 User: barney Password: bedrock NOTE: <db> is the database in the examples below and is optional. The leading slash '/' is required. WARNING: When using 'beeline' and specifying the connection url (-u) at the command line, be sure to quote the url. Non-Secure Environments Direct - Binary Transport Mode beeline -n barney -p bedrock -u "jdbc:hive2://m1.hdp.local:10010/<db>" Direct - HTTP Transport Mode beeline -n barney -p bedrock -u "jdbc:hive2://m1.hdp.local:10011/<db>;transportMode=http;httpPath=cliservice" ZooKeeper - Binary Transport Mode beeline -n barney -p bedrock -u "jdbc:hive2://m1.hdp.local:2181,m2.hdp.local:2181,m3.hdp.local:2181/<db>" ZooKeeper - Http Transport Mode beeline -n barney -p bedrock -u "jdbc:hive2://m1.hdp.local:2181,m2.hdp.local:2181,m3.hdp.local:2181/<db>;transportMode=http;httpPath=cliservice" Alternate Connectivity Thru Knox jdbc:hive2://<knox_host>:8443/;ssl=true;sslTrustStore=/var/lib/knox/data/security/keystores/gateway.jks;trustStorePassword=<password>?hive.server2.transport.mode=http;hive.server2.thrift.http.path=gateway/<CLUSTER>/hive Secure Environments Additional Assumptions KDC Realm: HDP.LOCAL HS2 Principal: hive The 'principal' used in the below examples can use either the fqdn of the HS2 Host in the principal or '_HOST'. '_HOST' is globally replaced based on your Kerberos configuration if you haven't altered the default Kerberos Regex patterns in ... NOTE: The client is required to 'kinit' before connecting through JDBC. The -n and -p (user / password) aren't necessary. They are handled by the Kerberos Ticket Principal. Direct - Binary Transport Mode beeline -u "jdbc:hive2://m1.hdp.local:10010/<db>;principal=hive/_HOST@HDP.LOCAL" Direct - HTTP Transport Mode beeline -u "jdbc:hive2://m1.hdp.local:10011/<db>;principal=hive/_HOST@HDP.LOCAL;transportMode=http;httpPath=cliservice" ZooKeeper - Binary Transport Mode beeline -u "jdbc:hive2://m1.hdp.local:2181,m2.hdp.local:2181,m3.hdp.local:2181/<db>;principal=hive/_HOST@HDP.LOCAL" ZooKeeper - Http Transport Mode
beeline -u "jdbc:hive2://m1.hdp.local:2181,m2.hdp.local:2181,m3.hdp.local:2181/<db>;principal=hive/_HOST@HDP.LOCAL;transportMode=http;httpPath=cliservice"
... View more
- Find more articles tagged with:
- Data Processing
- hiveserver2
- how-to-tutorial
- How-ToTutorial
- jdbc
Labels:
11-17-2015
10:20 AM
Almost forgot... Another side affect of small files on a cluster shows up while running the Balancer. You'll move LESS data, and increase the impact on the Namenode even further. So YES, small files are bad. But only as bad as you're willing to "pay" for in most cases. If you don't mind doubling the size of the cluster to address some files, then do it. But I wouldn't. I'd do a bit of planning and refinement and bank the extra nodes for more interesting projects. 🙂
... View more
11-17-2015
10:15 AM
22 Kudos
I've seen several systems with 400+ million objects represented in the Namenode without issues. In my opinion, that's not the "right" question though. Certainly, the classic answer to small files has been the pressure it put's on the Namenode but that's only a part of the equation. And with hardware / cpu and increase memory thresholds, that number has certainly climbed over the years since the small file problem was documented. The better question is: How do small files "impact" cluster performance? Everything is a trade-off when dealing with data at scale. The impact of small files, beyond the Namenode pressures, is more specifically related to "job" performance. Under classic MR, the number of small files controls the number of mappers required to perform a job. Of course, there are tricks to "combine" inputs and reduce this, but that leads to a lot of data back planing and increased cluster I/O chatter. A mapper in the classic sense, is a costly resource to allocate. If the actual task done by the mapper is rather mundane, most of the time spent accomplishing your job can be "administrative" in nature with the construction and management of all those resources. Consider the impact to a cluster when this happens. For example, I had a client once that was trying to get more from their cluster but there was a job that was processing 80,000 files. Which lead to the creation of 80,000 mappers. Which lead to consuming ALL the cluster resources, several times over. Follow that path a bit further and you'll find that the impact on the Namenode is exacerbated with all of the intermediate files generated by the mapper for the shuffle/sort phases. That's the real impact on a cluster. A little work in the beginning can have a dramatic affect on the downstream performance of your jobs. Take the time to "refine" your data and consolidate your files. Here's another way to approach it, which is even more evident when dealing with ORC files. Processing a 1Mb file has an overhead to it. So processing 128 1Mb files will cost you 128 times more "administrative" overhead, versus processing 1 128Mb file. In plain text, that 1Mb file may contain 1000 records. The 128 Mb file might contain 128000 records. And I've typically seen 85-92% compression ratio with ORC files, so you could safely say that a 128 Mb ORC file contains over 1 Million records. Sidebar: Which may of been why the default strip size in ORC's was changed to 64Mb, instead of 128Mb a few version back. The impact is multi-fold. With data locality, you move less data, process larger chunks of data at a time, generate fewer intermediate files, reduce impact to the Namenode and increase throughput overall, EVERYWHERE. The system moves away from being I/O bound to being CPU bound. Now you have the opportunity to tune container sizes to match "what" you're doing, because the container is actually "doing" a lot of work processing your data and not "manage" the job. Sometimes small files can't be avoided, but deal with them early, to limit the repetitive impact to your cluster. Here's a lists of general patterns to reduce the number of small files: Nifi - Use a combine processor to consolidate flows and aggregate data before if even gets to your cluster. Flume - Use a tiered Flume architecture to combine events from multiple inputs, producing "right" sized HDFS files for further refinement. Hive - Process the Small files regularly and often to produce larger files for "repetitive" processing. And in a classic pattern that incrementally "appends" to a dataset, creating a LOT of files over time, don't be afraid to go back and "reprocess" the file set again to streamline the impact on downstream tasks. Sqoop - Manager the number of mappers to generate appropriately size files. Oh, and if you NEED to keep those small files as "sources"... Archive them using hadoop archive resources 'har' and save your Namenode from the cost of managing those resource objects.
... View more
10-14-2015
09:34 AM
1 Kudo
Use NiFi to get the Data to HDFS and then Oozie Datasets to trigger actions based on data availability. Until NiFi, various versions of your describe method were common practice.
... View more
10-14-2015
09:14 AM
2 Kudos
Try creating a temporary database and move the table 'as is' into the new database. CREATE DATABASE if not exists Junk;
USE targetDB;
ALTER TABLE MyCorruptTable RENAME TO Junk.MyMovedCorruptTable;
DROP DATABASE JUNK Cascade;
... View more
10-14-2015
08:47 AM
1 Kudo
You need to increase the memory settings for Ambari. I ran into this a while back with certain views. I added/adjusted the following in: /var/lib/ambari-server/ambari-env.sh For "AMBARI_JVM_ARGS" -Xmx4G -XX:MaxPermSize=512m
... View more
10-13-2015
03:07 PM
1 Kudo
It should be GRANT ALL to just it's Oozie Database. Because the 'oozie' user needs to be able to create the schema in the target database.
... View more
10-13-2015
02:53 PM
1 Kudo
Using that key and signed into the Ambari Server as 'root', can you SSH to the target hosts from a command line? If you can't, double check the permissions of the "public" key on the target hosts. ~/.ssh should be 700 and the authorized_keys file should be 600.
... View more