Member since 
    
	
		
		
		06-20-2016
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                488
            
            
                Posts
            
        
                433
            
            
                Kudos Received
            
        
                118
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 3600 | 08-25-2017 03:09 PM | |
| 2501 | 08-22-2017 06:52 PM | |
| 4188 | 08-09-2017 01:10 PM | |
| 8969 | 08-04-2017 02:34 PM | |
| 8946 | 08-01-2017 11:35 AM | 
			
    
	
		
		
		05-18-2017
	
		
		02:56 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 You are living dangerously when you get to 80% disk usage.  This is because batch jobs write intermediate data to local non-HDFS disk (map-reduce writes a lot of data to local disk, tez less so) and that temp data can approach or exceed 20% of available disk (depends of course on the jobs you are running).  Also, if you are on physical servers (vs cloud) you need the lead time to provision, rack, stack etc to scale out and add new data nodes, and you likely will continue to ingest new data during this lead time.    It is a good practice to set it at 70% and have a plan in place when it reaches that.  (If you are ingesting large volumes on a scheduled basis, you may want to go lower).  Another good practice is to compress data that you rarely process, using non-splittable codecs (you can decompress on the rare times you need the data) and possible other data that is still processed using splittable codecs.  Automating compression is desirable. Compression is a bit of an involved topic.  This is a useful first reference: http://www.dummies.com/programming/big-data/hadoop/compressing-data-in-hadoop/  I would compress or delete data in the cluster you are referencing, and add more data nodes ASAP. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-16-2017
	
		
		12:01 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 To operate on one line at a time, before ExecuteScript processor use a SplitText processor (this will feed your script single lines) and after your ExecuteScript use a MergeText (to append emitted lines into one flow file).  In ExecuteScript, the code should be something like:      def output = ""
     cells.each(){it ->
         output = output + it + "\t" // do something
    }
    output = output + path + "\n"  If you need to know which cell you are on, you can use a counter like def i = 0 and increment in the loop 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-16-2017
	
		
		12:54 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 The file path in GetFile configuration is referring to the local file path where the NiFi instance is installed.  Your screenshot showing /tmp/nifi/input is on HDFS, not the local OS.  Please create /tmp/nifi/input on the linux OS where nifi is installed and place your data in there.  Your GetFile processor, as configured, will find it there.  (Note: if you wanted to retrieve a file for hdfs, you would use GetHDFS processor https://nifi.apache.org/docs.html.   Part of the config for this is pointing to a local copy of hadoop config files pulled from the cluster.  The processor uses these files to connect to hdfs.  See link for details.) 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-15-2017
	
		
		01:45 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 From  Hive doc https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli   Logging  Hive uses log4j for logging. These logs are not emitted to the standard output by default but are instead captured to a log file specified by Hive's log4j properties file. By default Hive will use  hive-log4j.default  in the  conf/  directory of the Hive installation which writes out logs to  /tmp/<userid>/hive.log  and uses the  WARN  level.  It is often desirable to emit the logs to the standard output and/or change the logging level for debugging purposes. These can be done from the command line as follows:   $HIVE_HOME/bin/hive --hiveconf hive.root.logger=INFO,console
 
  hive.root.logger  specifies the logging level as well as the log destination. Specifying  console  as the target sends the logs to the standard error (instead of the log file).  See Hive Logging in Getting Started for more information. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-15-2017
	
		
		01:06 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Use this to pull the flow file attributes into groovy:  def path = flowFile.getAttribute('path')
def filename = flowFile.getAttribute('filename')  After that it is pure groovy string manipulation to add to column, remove values, etc  Note when you tokenize you have an List where each field is indexed (e.g. a[0], a[1] etc.  To add a field you would use a.add(path).  After adding new fields or manipulating old fields you would have to reconstruct the string as tab-delim record.  You would then have to write to the OutputStream, catch errors, and set the session failure or success.   This is code is similar to what you would do.  (This code emits each record as a flowfile; if you wanted to emit the full recordset you would concatenate each record into one string with a newline at the end of each record except the end.)  import org.apache.commons.io.IOUtils
import java.nio.charset.*
def flowFile = session.get()
if(!flowFile) return
def path = flowFile.getAttribute('path')
def fail = false
flowFile = session.write(flowFile, {inputStream, outputStream ->
    try {
def recordIn = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
def cells = recordIn.split(',')
def recordOut = cells[0]+','+
cells[1]+','+    //you could validate this or any field
cells[2]+','+
path+','+
cells[3]+','+
cells[4]+','+
cells[5]+','+
cells[6]+','+
cells[7]
            outputStream.write(recordOut.getBytes(StandardCharsets.UTF_8))
            recordOut = ''
    }
    catch(e) {
    log.error("Error during processing of validate.groovy", e)
    session.transfer(inputStream, REL_FAILURE)
    fail=true
    }
} as StreamCallback)
if(fail){
session.transfer(flowFile, REL_FAILURE)
fail = false
} else {
session.transfer(flowFile, REL_SUCCESS)
} 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-14-2017
	
		
		01:38 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 You have defined the first field as an int.   Your data fields are ints, so you see them.  But your header is a chararray and pig throws this casting error (string to int) by simply returning empty character.  If you use Piggybank, you can skip the header: http://stackoverflow.com/questions/29335656/hadoop-pig-removing-csv-header 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-12-2017
	
		
		02:38 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi @Bin Ye  Keep posting (questions, answers, articles) and sharing your experience ... everyone in the community benefits 🙂 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-12-2017
	
		
		01:22 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 @Bin YeIf you found the answer useful, please accept or upvote ... that is how the community works 🙂 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
         
					
				













