About dstreev

dstreev · ‎04-04-2018

Does the user have access (File System Level) to the warehouse directory you've specified? The docs seem to indicate that the 'spark.sql.warehouse.dir' is optional when Hive is already present and you're attaching to a metastore. --- Users who do not have an existing Hive deployment can still enable Hive support. When not configured by the hive-site.xml , the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir , which defaults to the directory spark-warehouse in the current directory that the Spark application is started. --- Try omitting that setting from your application.

dstreev · ‎06-29-2017

That will only set it for newly created directories. Using the HDFS client, set the replication factor for the directory to the new value.

dstreev · ‎01-30-2017

Rahul, Are the logs making it to HDFS? It sounds like you might be combining the "spooling" directory with the "local audit archive directory". What properties did you use during the Ranger HDFS Plugin installation? Are you doing a manual install or using Ambari? If manual, then this reference might help: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_command-line-installation/content/installing_ranger_plugins.html#installing_ranger_hdfs_plugin I wasn't able to locate your "...filespool.archive.dir" property on my cluster. I'm not sure the property is required. And may be responsible for keeping the files "locally" that you've already posted to HDFS. If the files are making it to HDFS, I would try removing this setting. What do you have set for the property below? And are the contents being flushed from that location on a regular basis? xasecure.audit.destination.hdfs.batch.filespool.dir Compression doesn't happen during this process. Once they're on HDFS, you're free to do with them as you see fit. If compression is a part of that, then write an MR job to do so. (WARNING: Could affect other systems that might want to use these files as is) Cheers, David

dstreev · ‎01-27-2017

Those are intermediate directories used to store the stream of activity locally, before it's written to HDFS. You should have destination directories in HDFS for the final resting place. In my experience, when this issue happens and you don't see those directories in HDFS. It could be a permissions issue or the fact that the directories just need to be created manually. You may need to create the directories in HDFS manually and ensure they have the proper ACL's to allow them the be written to by the process.

dstreev · ‎01-13-2017

distcp recognizes the s3[a] protocols from the default libraries already available in Hadoop. For example: Moving data from Hadoop to S3. hadoop distcp <current_cluster_folder> s3[a]://<bucket_info> If you're looking for ways to manage access (via AWS Keys) to S3 Buckets in Hadoop, this article is a great secure way to do that. https://community.hortonworks.com/articles/59161/using-hadoop-credential-api-to-store-aws-secrets.html

dstreev · ‎01-13-2017

Jacqualin, Yes, the local dir and log dir both support multiple locations. And I advise using multiple locations to scale better. These directories aren't HDFS and therefore don't support hdfs replication, but that's ok. It's used for file caches and intermediate data. If you lose a drive in the middle of processing, only the "task" is affected, which may fail. In this case, the task is rescheduled somewhere else. So the job would be affected. A failed drive in yarn_local_dir is ok, as the NodeManager with tag it and not use it going forward. One more reason to have more than 1 drive specified here. BUT, in older versions of YARN, a failed drive can prevent the NodeManager from "starting" or "restarting." It's pretty clear in the logs of the NodeManager if you have issues with it starting at any time. Yarn also indicated drive failures in the Resource Manager UI. A Newer version of YARN is a bit more forgiving on startup.

dstreev · ‎01-09-2017

Having multiple values here allows for better scalability and performance for YARN and intermediate writes/reads. Much like HDFS has multiple directories (preferably on different mount point/physical drives), YARN LOCAL dirs can use this to spread the IO load. I also seen trends where customers use SSD drives for YARN LOCAL DIRS, which can significantly improve job performance. IE: 12 drive system. 8 drives are SATA drives for HDFS directories and 4 drives are smaller, fast SSD drives for YARN LOCAL DIRS.

dstreev · ‎01-09-2017

Could you please identify which version of Ambari you are running? In these situations, I usually drop down to the host that is presenting the issue and try to run the command on the host. This may help provide a bit more detail on the actual issue. In this case, you may find that you need to remove the offending package yum erase <specific_package>, then have Ambari try to reinstall the packages.

dstreev · ‎12-15-2016

The Problem Traditional 'distcp' from one directory to another or from cluster to cluster is quite useful in moving massive amounts of data, once. But what happens when you need to "update" a target directory or cluster with only the changes made since the last 'distcp' had run. That becomes a very tricky scenario. 'distcp' offers an '-update' flag, which is suppose to move only the files that have changed. In this case 'distcp' will pull a list of files and directories from the source and targets, compare them and then build a migration plan. The problem: It's an expensive and time-consuming task. Furthermore, the process is not atomic. First, the cost of gathering a list of files and directories, along with their metadata is expensive when you're considering sources with millions of file and directory objects. And this cost is incurred on both the source and target namenode's, resulting in quite a bit of pressure on those systems. It's up to 'distcp' to reconcile the difference between the source and target, which is very expensive. When it's finally complete, only then does the process start to move data. And if data changes while the process is running, those changes can impact the transfer and lead to failure and partial migration. The Solution The process needs to be atomic, and it needs to be efficient. With Hadoop 2.0, HDFS introduce "snapshots." HDFS "snapshots" are a point-in-time copy of the directories metadata. The copy is stored in a hidden location and maintains references to all of the immutable filesystem objects. Creating a snapshot is atomic, and the characteristics of HDFS (being immutable) means that an image of a directories metadata doesn't require an addition copy of the underlying data. Another feature of snapshots is the ability to efficiently calculate changes between 'any' two snapshots on the same directory. Using 'hdfs snapshotDiff ', you can build a list of "changes" between these two point-in-time references. For Example [hdfs@m3 ~]$ hdfs snapshotDiff /user/dstreev/stats s1 s2 Difference between snapshot s1 and snapshot s2 under directory /user/dstreev/stats: M . + ./attempt M ./namenode/fs_state/2016-12.txt M ./namenode/nn_info/2016-12.txt M ./namenode/top_user_ops/2016-12.txt M ./scheduler/queue_paths/2016-12.txt M ./scheduler/queue_usage/2016-12.txt M ./scheduler/queues/2016-12.txt Let's take the 'distcp' update concept and supercharge it with the efficiency of snapshots. Now you have a solution that will scale far beyond the original 'distcp -update.' and in the process remove the burden and load from the namenode's previously encountered. Pre-Requisites and Requirements Source must support 'snapshots' hdfs dfsadmin -allowSnapshot <path> Target is "read-only" Target, after initial baseline 'distcp' sync needs to support snapshots. Process Identify the source and target 'parent' directory Do not initially create the destination directory, allow the first distcp to do that. For example: If I want to sync source `/data/a` with `/data/a_target`, do *NOT* pre-create the 'a_target' directory. Allow snapshots on the source directory hdfs dfsadmin -allowSnapshot /data/a Create a Snapshot of /data/a hdfs dfs -createSnapshot /data/a s1 Distcp the baseline copy (from the atomic snapshot). Note: /data/a_target does NOT exists prior to the following command. hadoop distcp /data/a/.snapshot/s1 /data/a_target Allow snapshots on the newly create target directory hdfs dfsadmin -allowSnapshot /data/a_target At this point /data/a_target should be considered "read-only". Do NOT make any changes to the content here. Create a matching snapshot in /data/a_target that matches the name of the snapshot used to build the baseline hdfs dfs -createSnapshot /data/a_target s1 Add some content to the source directory /data/a. Make changes, add, deletes, etc. that need to be replicated to /data/a_target. Take a new snapshot of /data/a hdfs dfs -createSnapshot /data/a s2 Just for fun, check on whats changed between the two snapshots hdfs snapshotDiff /data/a s1 s2 Ok, now let's migrate the changes to /data/a_target hadoop distcp -diff s1 s2 -update /data/a /data/a_target When that's completed, finish the cycle by creating a matching snapshot on /data/a_target hdfs dfs -createSnapshot /data/a_target s2 That's it. You've completed the cycle. Rinse and repeat. A Few Hints Remember, snapshots need to be managed manually. They will stay around forever unless you clean them up with: hdfs dfs -deleteSnapshot As long as a snapshot exists, the data exists. Deleting, even with skipTrash, data from a directory that has a snapshot, doesn't free up space. Only when all "references" to that data are gone, can space be reclaimed. Initial migrations of data between systems are very expensive in regards to network I/O. And you probably don't want to have to do that again, ever. I recommend keeping a snapshot of the original copy on each system OR some major checkpoint you can go back to, in the event the process is compromised. If 'distcp' can't validate that the snapshot (by name) between the source and the target are the same and that the data at the target hasn't changed since the snapshot, the process will fail. If the failure is because the directory has been updated, you'll need to use the above baseline snapshots to restore it without having to migrate all that data again. And then start the process up again.

dstreev · ‎10-07-2016

Enabling Ranger audit's will show who made the sql call and what query was issued to HS2. This is more "metadata" centric, the actually data transferred is not logged in any permanent fashion. That would be the responsibility of the client. But the combination of the audit (who and what) along with possibly a "hdfs snapshot" can lead to a reproducible scenario.

Online	Offline
Last Visited	‎04-22-2024 12:06 PM

Member Since	‎07-30-2019 11:12 AM
Last Visited	‎04-22-2024 12:06 PM
Posts	53
Kudos received	130

Cloudera Community

Re: Audit HDFS spool logs not coming in acrhive fo...

Re: How to install the hadoop-aws module to copy f...

Re: Why we need multiple values for YARN_LOCAL_DIR...

Re: Unable to upgrade from HDP 2.3.4.0 to 2.3.4.7

Re: How can we track data transfer from hiveserver...

Re: Hive External Table with Parquet Format produc...

Re: Replication factor not changed in HDFS from hd...

Re: Audit HDFS spool logs not coming in acrhive fo...

Re: Audit HDFS spool logs not coming in acrhive fo...

Re: How to install the hadoop-aws module to copy f...

Re: Why we need multiple values for YARN_LOCAL_DIR...

Re: Why we need multiple values for YARN_LOCAL_DIR...

Re: Unable to upgrade from HDP 2.3.4.0 to 2.3.4.7

Managing Hadoop DR with 'distcp' and 'snapshots'

Re: How can we track data transfer from hiveserver...