Member since 
    
	
		
		
		07-30-2019
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                53
            
            
                Posts
            
        
                136
            
            
                Kudos Received
            
        
                16
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 11257 | 01-30-2017 05:05 PM | |
| 6582 | 01-13-2017 03:46 PM | |
| 3143 | 01-09-2017 05:36 PM | |
| 2078 | 01-09-2017 05:29 PM | |
| 1501 | 10-07-2016 03:34 PM | 
			
    
	
		
		
		04-04-2018
	
		
		11:42 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Does the user have access (File System Level) to the warehouse directory you've specified?  The docs seem to indicate that the 'spark.sql.warehouse.dir' is optional when Hive is already present and you're attaching to a metastore.   ---  Users who do not have an existing Hive deployment can still enable Hive support. When not configured by the  hive-site.xml , the context automatically creates  metastore_db  in the current directory and creates a directory configured by  spark.sql.warehouse.dir , which defaults to the directory  spark-warehouse  in the current directory that the Spark application is started.  ---  Try omitting that setting from your application.   
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		06-29-2017
	
		
		05:53 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 That will only set it for newly created directories. Using the HDFS client, set the replication factor for the directory to the new value. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		01-30-2017
	
		
		05:05 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 	Rahul,  	Are the logs making it to HDFS?  It sounds like you might be combining the "spooling" directory with the "local audit archive directory".  What properties did you use during the Ranger HDFS Plugin installation?  Are you doing a manual install or using Ambari?  	If manual, then this reference might help: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_command-line-installation/content/installing_ranger_plugins.html#installing_ranger_hdfs_plugin  I wasn't able to locate your "...filespool.archive.dir" property on my cluster.  I'm not sure the property is required.  And may be responsible for keeping the files "locally" that you've already posted to HDFS.  If the files are making it to HDFS, I would try removing this setting.  What do you have set for the property below?  And are the contents being flushed from that location on a regular basis?  xasecure.audit.destination.hdfs.batch.filespool.dir  	Compression doesn't happen during this process.  Once they're on HDFS, you're free to do with them as you see fit.  If compression is a part of that, then write an MR job to do so. (WARNING: Could affect other systems that might want to  use these files as is)  	Cheers,  	David 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		01-27-2017
	
		
		01:29 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Those are intermediate directories used to store the stream of activity locally, before it's written to HDFS.  You should have destination directories in HDFS for the final resting place.  In my experience, when this issue happens and you don't see those directories in HDFS.  It could be a permissions issue or the fact that the directories just need to be created manually.  You may need to create the directories in HDFS manually and ensure they have the proper ACL's to allow them the be written to by the process.     
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		01-13-2017
	
		
		03:46 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 distcp recognizes the s3[a] protocols from the default libraries already available in Hadoop.    For example: Moving data from Hadoop to S3.  hadoop distcp <current_cluster_folder> s3[a]://<bucket_info>  If you're looking for ways to manage access (via AWS Keys) to S3 Buckets in Hadoop, this article is a great secure way to do that.  https://community.hortonworks.com/articles/59161/using-hadoop-credential-api-to-store-aws-secrets.html 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		01-13-2017
	
		
		03:23 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Jacqualin,  Yes, the local dir and log dir both support multiple locations.  And I advise using multiple locations to scale better.  These directories aren't HDFS and therefore don't support hdfs replication, but that's ok.  It's used for file caches and intermediate data.  If you lose a drive in the middle of processing, only the "task" is affected, which may fail.  In this case, the task is rescheduled somewhere else.  So the job would be affected.  A failed drive in yarn_local_dir is ok, as the NodeManager with tag it and not use it going forward.  One more reason to have more than 1 drive specified here.  BUT, in older versions of YARN, a failed drive can prevent the NodeManager from "starting" or "restarting."  It's pretty clear in the logs of the NodeManager if you have issues with it starting at any time.  Yarn also indicated drive failures in the Resource Manager UI.  A Newer version of YARN is a bit more forgiving on startup. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		01-09-2017
	
		
		05:36 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Having multiple values here allows for better scalability and performance for YARN and intermediate writes/reads.  Much like HDFS has multiple directories (preferably on different mount point/physical drives), YARN LOCAL dirs can use this to spread the IO load.  I also seen trends where customers use SSD drives for YARN LOCAL DIRS, which can significantly improve job performance.  IE: 12 drive system.  8 drives are SATA drives for HDFS directories and 4 drives are smaller, fast SSD drives for YARN LOCAL DIRS. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		01-09-2017
	
		
		05:29 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Could you please identify which version of Ambari you are running?  In these situations, I usually drop down to the host that is presenting the issue and try to run the command on the host.  This may help provide a bit more detail on the actual issue.  In this case, you may find that you need to remove the offending package yum erase <specific_package>, then have Ambari try to reinstall the packages. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		12-15-2016
	
		
		01:40 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		9 Kudos
		
	
				
		
	
		
					
							 The Problem  Traditional 'distcp' from one directory to another or from cluster to cluster is quite useful in moving massive amounts of data, once. But what happens when you need to "update" a target directory or cluster with only the changes made since the last 'distcp' had run. That becomes a very tricky scenario. 'distcp' offers an '-update' flag, which is suppose to move only the files that have changed. In this case 'distcp' will pull a list of files and directories from the source and targets, compare them and then build a migration plan.  The problem: It's an expensive and time-consuming task. Furthermore, the process is not atomic. First, the cost of gathering a list of files and directories, along with their metadata is expensive when you're considering sources with millions of file and directory objects. And this cost is incurred on both the source and target namenode's, resulting in quite a bit of pressure on those systems.  It's up to 'distcp' to reconcile the difference between the source and target, which is very expensive. When it's finally complete, only then does the process start to move data. And if data changes while the process is running, those changes can impact the transfer and lead to failure and partial migration.  The Solution  The process needs to be atomic, and it needs to be efficient. With Hadoop 2.0, HDFS introduce "snapshots." HDFS "snapshots" are a point-in-time copy of the directories metadata. The copy is stored in a hidden location and maintains references to all of the immutable filesystem objects. Creating a snapshot is atomic, and the characteristics of HDFS (being immutable) means that an image of a directories metadata doesn't require an addition copy of the underlying data.  Another feature of snapshots is the ability to efficiently calculate changes between 'any' two snapshots on the same directory. Using 'hdfs snapshotDiff ', you can build a list of "changes" between these two point-in-time references.  For Example  [hdfs@m3 ~]$ hdfs snapshotDiff /user/dstreev/stats s1 s2
Difference between snapshot s1 and snapshot s2 under directory /user/dstreev/stats:
M       .
+       ./attempt
M       ./namenode/fs_state/2016-12.txt
M       ./namenode/nn_info/2016-12.txt
M       ./namenode/top_user_ops/2016-12.txt
M       ./scheduler/queue_paths/2016-12.txt
M       ./scheduler/queue_usage/2016-12.txt
M       ./scheduler/queues/2016-12.txt
  Let's take the 'distcp' update concept and supercharge it with the efficiency of snapshots. Now you have a solution that will scale far beyond the original 'distcp -update.' and in the process remove the burden and load from the namenode's previously encountered.  Pre-Requisites and Requirements   Source must support 'snapshots'   hdfs dfsadmin -allowSnapshot <path>   Target is "read-only"  Target, after initial baseline 'distcp' sync needs to support snapshots.   Process   Identify the source and target 'parent' directory  Do not initially create the destination directory, allow the first distcp to do that. For example: If I want to sync source `/data/a` with `/data/a_target`, do *NOT* pre-create the 'a_target' directory.    Allow snapshots on the source directory    hdfs dfsadmin -allowSnapshot /data/a   Create a Snapshot of /data/a   hdfs dfs -createSnapshot /data/a s1   Distcp the baseline copy (from the atomic snapshot). Note: /data/a_target does NOT exists prior to the following command.   hadoop distcp /data/a/.snapshot/s1 /data/a_target   Allow snapshots on the newly create target directory   hdfs dfsadmin -allowSnapshot /data/a_target   At this point /data/a_target should be considered "read-only". Do NOT make any changes to the content here.  Create a matching snapshot in /data/a_target that matches the name of the snapshot used to build the baseline   hdfs dfs -createSnapshot /data/a_target s1   Add some content to the source directory /data/a. Make changes, add, deletes, etc. that need to be replicated to /data/a_target.  Take a new snapshot of /data/a   hdfs dfs -createSnapshot /data/a s2   Just for fun, check on whats changed between the two snapshots   hdfs snapshotDiff /data/a s1 s2   Ok, now let's migrate the changes to /data/a_target   hadoop distcp -diff s1 s2 -update /data/a /data/a_target   When that's completed, finish the cycle by creating a matching snapshot on /data/a_target   hdfs dfs -createSnapshot /data/a_target s2  That's it. You've completed the cycle. Rinse and repeat.   A Few Hints  Remember, snapshots need to be managed manually. They will stay around forever unless you clean them up with:   hdfs dfs -deleteSnapshot   As long as a snapshot exists, the data exists. Deleting, even with skipTrash, data from a directory that has a snapshot, doesn't free up space. Only when all "references" to that data are gone, can space be reclaimed.   Initial migrations of data between systems are very expensive in regards to network I/O. And you probably don't want to have to do that again, ever. I recommend keeping a snapshot of the original copy on each system OR some major checkpoint you can go back to, in the event the process is compromised.   If 'distcp' can't validate that the snapshot (by name) between the source and the target are the same and that the data at the target hasn't changed since the snapshot, the process will fail. If the failure is because the directory has been updated, you'll need to use the above baseline snapshots to restore it without having to migrate all that data again. And then start the process up again.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		10-07-2016
	
		
		03:34 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Enabling Ranger audit's will show who made the sql call and what query was issued to HS2.  This is more "metadata" centric, the actually data transferred is not logged in any permanent fashion.  That would be the responsibility of the client.  But the combination of the audit (who and what) along with possibly a "hdfs snapshot" can lead to a reproducible scenario. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
        













