Member since 
    
	
		
		
		09-23-2015
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                151
            
            
                Posts
            
        
                110
            
            
                Kudos Received
            
        
                50
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 1810 | 08-12-2016 04:05 AM | |
| 3803 | 08-07-2016 03:58 AM | |
| 1614 | 07-27-2016 06:24 PM | |
| 2014 | 07-20-2016 03:14 PM | |
| 1548 | 07-18-2016 12:54 PM | 
			
    
	
		
		
		03-21-2016
	
		
		12:21 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 An easier solution would be to read every row, and then add a check to see if it's a header row by checking one of the columns that you know only appears in a header row.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-21-2016
	
		
		12:03 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 I'm not sure why you are getting that exception, but I do know that you do not need to get the path of any input files on the exam. What are you trying to do? Showing more of your code might help me to provide more insight.  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-18-2016
	
		
		04:05 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 There is some good information in this thread, but I worry that the discussion about the MultiStorage class in the piggybank is going to seem like it's needed on the HDPCD exam. The MultiStorage class is not a part of the exam objectives. For the exam, you need to know how to use the PARALLEL operator, which if used at the right time in a Pig script can determine the number of output files.  So to summarize: the HDPCD exam does not require the use of MutliStorage, but may require the use of PARALLEL. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-16-2016
	
		
		02:42 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		3 Kudos
		
	
				
		
	
		
					
							 Sorting in MR applies to two areas:   Sort output by keys: this done "naturally" in the sense that the keys are sorted as they come into the reducer. The compareTo method in the key class determines this natural sorting  Secondary sort: both the keys and values are sorted. That involves writing a group comparator class and then registering that class with the MR Job using the setGroupingComparator class   The exam objective you listed above is referring to both. The first one is fairly straightforward - you implement the compareTo method in your key class. The secondary sort involves a bit more work. There is a nice blog here that has an example of how to implement a secondary sort:  https://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-programming-paradigm/ 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-16-2016
	
		
		12:07 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 @Sanjay Ungarala  - thank you for pointing this out. We decided a while back to remove this objective from the HDPCD:Java exam, but the website was not changed accordingly. I just updated the website and removed that objective. 
LocalResource is used heavily in YARN, but it is beyond the scope of our Java exam. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-15-2016
	
		
		11:45 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 You are given access to docs.hortonworks.com on the real exam, and that is the only website you can access.  The practice exam is meant to give you an idea of what the real exam tasks are like and also to become familiar with the environment of the real exam. It is not enough to pass the exam though. Like I said, to be fully prepared, you should be able to perform all of the tasks listed in the exam objectives on our website. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-12-2016
	
		
		02:52 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 @Mahesh Deshmukh I'm not sure what your need is, but null values should be filtered first. The general rule of thumb in Pig is to "filter early and often" to minimize the amount of data that gets shuffled and sorted, so before the foreach:  For example:  a = "some Pig relation"
b = filter a by $1 is not null;   //filter out tuples where the $1 field is null
c = foreach b generate ...	  //no need to worry about $1 being null  The term "empty" refers to bags typically, and in particular you can use the isEmpty function to check if a bag is empty. You normally do this after a GROUP command:  a = "some Pig relation"
b = group a by $3;
c = filter b by not IsEmpty(group);  What are you trying to accomplish?  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-11-2016
	
		
		07:56 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		3 Kudos
		
	
				
		
	
		
					
							 Sure - the following simple script uses 3 reducers on the last operation, so there will be 3 output files:  a = load 'something';
b = order a by $1 parallel 3;
store b into 'somewhere';  PARALLEL is not an option on STORE, but it is an option on a lot of other Pig operations. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-11-2016
	
		
		12:13 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		3 Kudos
		
	
				
		
	
		
					
							 In case someone is searching for this in regards to the Hortonworks Certified Developer exam, the question was asked here also:  https://community.hortonworks.com/questions/22439/where-do-i-get-references-for-piggybank.html  Many of the Pig operators have a PARALLEL option for specifying the number of reducers, which also determines the number of output files. For the intent of the certification exam, using PARALLEL is all you need to accomplish this task, plus it is much simpler than trying to register the piggybank and use a special output class. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		03-11-2016
	
		
		12:09 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 Many of the Pig operators have a PARALLEL option for specifying the number of reducers, which also determines the number of output files. For the intent of the practice exam and the real exam, using PARALLEL is all you need to accomplish this task. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
         
					
				













