About gkeys

gkeys · ‎05-18-2017

You are living dangerously when you get to 80% disk usage. This is because batch jobs write intermediate data to local non-HDFS disk (map-reduce writes a lot of data to local disk, tez less so) and that temp data can approach or exceed 20% of available disk (depends of course on the jobs you are running). Also, if you are on physical servers (vs cloud) you need the lead time to provision, rack, stack etc to scale out and add new data nodes, and you likely will continue to ingest new data during this lead time. It is a good practice to set it at 70% and have a plan in place when it reaches that. (If you are ingesting large volumes on a scheduled basis, you may want to go lower). Another good practice is to compress data that you rarely process, using non-splittable codecs (you can decompress on the rare times you need the data) and possible other data that is still processed using splittable codecs. Automating compression is desirable. Compression is a bit of an involved topic. This is a useful first reference: http://www.dummies.com/programming/big-data/hadoop/compressing-data-in-hadoop/ I would compress or delete data in the cluster you are referencing, and add more data nodes ASAP.

gkeys · ‎05-16-2017

To operate on one line at a time, before ExecuteScript processor use a SplitText processor (this will feed your script single lines) and after your ExecuteScript use a MergeText (to append emitted lines into one flow file). In ExecuteScript, the code should be something like: def output = "" cells.each(){it -> output = output + it + "\t" // do something } output = output + path + "\n" If you need to know which cell you are on, you can use a counter like def i = 0 and increment in the loop

gkeys · ‎05-16-2017

The file path in GetFile configuration is referring to the local file path where the NiFi instance is installed. Your screenshot showing /tmp/nifi/input is on HDFS, not the local OS. Please create /tmp/nifi/input on the linux OS where nifi is installed and place your data in there. Your GetFile processor, as configured, will find it there. (Note: if you wanted to retrieve a file for hdfs, you would use GetHDFS processor https://nifi.apache.org/docs.html. Part of the config for this is pointing to a local copy of hadoop config files pulled from the cluster. The processor uses these files to connect to hdfs. See link for details.)

gkeys · ‎05-15-2017

Hmmm, did I notice you are on windows? What version HDP, 2.4?

gkeys · ‎05-15-2017

From Hive doc https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli Logging Hive uses log4j for logging. These logs are not emitted to the standard output by default but are instead captured to a log file specified by Hive's log4j properties file. By default Hive will use hive-log4j.default in the conf/ directory of the Hive installation which writes out logs to /tmp/<userid>/hive.log and uses the WARN level. It is often desirable to emit the logs to the standard output and/or change the logging level for debugging purposes. These can be done from the command line as follows: $HIVE_HOME/bin/hive --hiveconf hive.root.logger=INFO,console hive.root.logger specifies the logging level as well as the log destination. Specifying console as the target sends the logs to the standard error (instead of the log file). See Hive Logging in Getting Started for more information.

gkeys · ‎05-15-2017

Use this to pull the flow file attributes into groovy: def path = flowFile.getAttribute('path') def filename = flowFile.getAttribute('filename') After that it is pure groovy string manipulation to add to column, remove values, etc Note when you tokenize you have an List where each field is indexed (e.g. a[0], a[1] etc. To add a field you would use a.add(path). After adding new fields or manipulating old fields you would have to reconstruct the string as tab-delim record. You would then have to write to the OutputStream, catch errors, and set the session failure or success. This is code is similar to what you would do. (This code emits each record as a flowfile; if you wanted to emit the full recordset you would concatenate each record into one string with a newline at the end of each record except the end.) import org.apache.commons.io.IOUtils import java.nio.charset.* def flowFile = session.get() if(!flowFile) return def path = flowFile.getAttribute('path') def fail = false flowFile = session.write(flowFile, {inputStream, outputStream -> try { def recordIn = IOUtils.toString(inputStream, StandardCharsets.UTF_8) def cells = recordIn.split(',') def recordOut = cells[0]+','+ cells[1]+','+ //you could validate this or any field cells[2]+','+ path+','+ cells[3]+','+ cells[4]+','+ cells[5]+','+ cells[6]+','+ cells[7] outputStream.write(recordOut.getBytes(StandardCharsets.UTF_8)) recordOut = '' } catch(e) { log.error("Error during processing of validate.groovy", e) session.transfer(inputStream, REL_FAILURE) fail=true } } as StreamCallback) if(fail){ session.transfer(flowFile, REL_FAILURE) fail = false } else { session.transfer(flowFile, REL_SUCCESS) }

gkeys · ‎05-14-2017

You have defined the first field as an int. Your data fields are ints, so you see them. But your header is a chararray and pig throws this casting error (string to int) by simply returning empty character. If you use Piggybank, you can skip the header: http://stackoverflow.com/questions/29335656/hadoop-pig-removing-csv-header

gkeys · ‎05-12-2017

Hi @Bin Ye Keep posting (questions, answers, articles) and sharing your experience ... everyone in the community benefits 🙂

gkeys · ‎05-12-2017

@Bin YeIf you found the answer useful, please accept or upvote ... that is how the community works 🙂

gkeys · ‎05-12-2017

@Bin Ye ... how is your follow-up going?

Online	Offline
Last Visited	‎06-11-2019 01:24 AM

Member Since	‎06-20-2016 01:29 PM
Last Visited	‎06-11-2019 01:24 AM
Posts	488
Kudos received	430

Cloudera Community

Re: DR for hadoop

Re: API + how to know by API command all machines ...

Re: Does data get copied in edge node from externa...

Re: is it possible to set the hadoop.tmp.dir value...

Re: How to handle nulls when exporting from Hive?

Re: HDFS Disk Usage and datanode storage threshold...

Re: change and add data to flowfiles in executescr...

Re: 'Input directory' validated against '/tmp/nifi...

Re: How to hide INFO & WARN while hive execution

Re: How to hide INFO & WARN while hive execution

Re: change and add data to flowfiles in executescr...

Re: Pig --> Header first column is not showing

Re: HBase schema design for complex data

Re: HBase schema design for complex data

Re: HBase schema design for complex data