Member since
06-20-2016
488
Posts
433
Kudos Received
118
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3115 | 08-25-2017 03:09 PM | |
1965 | 08-22-2017 06:52 PM | |
3406 | 08-09-2017 01:10 PM | |
8076 | 08-04-2017 02:34 PM | |
8120 | 08-01-2017 11:35 AM |
05-18-2017
02:56 PM
You are living dangerously when you get to 80% disk usage. This is because batch jobs write intermediate data to local non-HDFS disk (map-reduce writes a lot of data to local disk, tez less so) and that temp data can approach or exceed 20% of available disk (depends of course on the jobs you are running). Also, if you are on physical servers (vs cloud) you need the lead time to provision, rack, stack etc to scale out and add new data nodes, and you likely will continue to ingest new data during this lead time. It is a good practice to set it at 70% and have a plan in place when it reaches that. (If you are ingesting large volumes on a scheduled basis, you may want to go lower). Another good practice is to compress data that you rarely process, using non-splittable codecs (you can decompress on the rare times you need the data) and possible other data that is still processed using splittable codecs. Automating compression is desirable. Compression is a bit of an involved topic. This is a useful first reference: http://www.dummies.com/programming/big-data/hadoop/compressing-data-in-hadoop/ I would compress or delete data in the cluster you are referencing, and add more data nodes ASAP.
... View more
05-16-2017
12:01 PM
To operate on one line at a time, before ExecuteScript processor use a SplitText processor (this will feed your script single lines) and after your ExecuteScript use a MergeText (to append emitted lines into one flow file). In ExecuteScript, the code should be something like: def output = ""
cells.each(){it ->
output = output + it + "\t" // do something
}
output = output + path + "\n" If you need to know which cell you are on, you can use a counter like def i = 0 and increment in the loop
... View more
05-16-2017
12:54 AM
1 Kudo
The file path in GetFile configuration is referring to the local file path where the NiFi instance is installed. Your screenshot showing /tmp/nifi/input is on HDFS, not the local OS. Please create /tmp/nifi/input on the linux OS where nifi is installed and place your data in there. Your GetFile processor, as configured, will find it there. (Note: if you wanted to retrieve a file for hdfs, you would use GetHDFS processor https://nifi.apache.org/docs.html. Part of the config for this is pointing to a local copy of hadoop config files pulled from the cluster. The processor uses these files to connect to hdfs. See link for details.)
... View more
05-15-2017
01:45 PM
From Hive doc https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli Logging Hive uses log4j for logging. These logs are not emitted to the standard output by default but are instead captured to a log file specified by Hive's log4j properties file. By default Hive will use hive-log4j.default in the conf/ directory of the Hive installation which writes out logs to /tmp/<userid>/hive.log and uses the WARN level. It is often desirable to emit the logs to the standard output and/or change the logging level for debugging purposes. These can be done from the command line as follows: $HIVE_HOME/bin/hive --hiveconf hive.root.logger=INFO,console
hive.root.logger specifies the logging level as well as the log destination. Specifying console as the target sends the logs to the standard error (instead of the log file). See Hive Logging in Getting Started for more information.
... View more
05-15-2017
01:06 PM
1 Kudo
Use this to pull the flow file attributes into groovy: def path = flowFile.getAttribute('path')
def filename = flowFile.getAttribute('filename') After that it is pure groovy string manipulation to add to column, remove values, etc Note when you tokenize you have an List where each field is indexed (e.g. a[0], a[1] etc. To add a field you would use a.add(path). After adding new fields or manipulating old fields you would have to reconstruct the string as tab-delim record. You would then have to write to the OutputStream, catch errors, and set the session failure or success. This is code is similar to what you would do. (This code emits each record as a flowfile; if you wanted to emit the full recordset you would concatenate each record into one string with a newline at the end of each record except the end.) import org.apache.commons.io.IOUtils
import java.nio.charset.*
def flowFile = session.get()
if(!flowFile) return
def path = flowFile.getAttribute('path')
def fail = false
flowFile = session.write(flowFile, {inputStream, outputStream ->
try {
def recordIn = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
def cells = recordIn.split(',')
def recordOut = cells[0]+','+
cells[1]+','+ //you could validate this or any field
cells[2]+','+
path+','+
cells[3]+','+
cells[4]+','+
cells[5]+','+
cells[6]+','+
cells[7]
outputStream.write(recordOut.getBytes(StandardCharsets.UTF_8))
recordOut = ''
}
catch(e) {
log.error("Error during processing of validate.groovy", e)
session.transfer(inputStream, REL_FAILURE)
fail=true
}
} as StreamCallback)
if(fail){
session.transfer(flowFile, REL_FAILURE)
fail = false
} else {
session.transfer(flowFile, REL_SUCCESS)
}
... View more
05-14-2017
01:38 PM
You have defined the first field as an int. Your data fields are ints, so you see them. But your header is a chararray and pig throws this casting error (string to int) by simply returning empty character. If you use Piggybank, you can skip the header: http://stackoverflow.com/questions/29335656/hadoop-pig-removing-csv-header
... View more
05-12-2017
02:38 PM
Hi @Bin Ye Keep posting (questions, answers, articles) and sharing your experience ... everyone in the community benefits 🙂
... View more
05-12-2017
01:22 PM
1 Kudo
@Bin YeIf you found the answer useful, please accept or upvote ... that is how the community works 🙂
... View more