About gkeys

gkeys · ‎11-05-2016

Overview In this article I show how to very quickly build custom logging against source log files. Adding new custom logging to the flow takes only seconds: each new custom log is copy-pasted from an existing one and 3 configuration properties are changed. The result is a single flow with many diverse custom logs against your system. Use Cases Use cases include: Application development: permanent or throway logging focused on debugging software issues or NiFi flows. Production: logging for focused metrics gathering, audting, faster troubleshooting, alerting All envts: prefiltering Splunk ingests to save on cost and filtering out unneeded log data Use your imagingation … how can this flow pattern make your life better / more effective? Flow Design Single customized log A single customized log is achieved by tailing a log file, splitting the lines of text, extracting lines that match a regex, specifiying a custom log filename for the regex-extracted lines, and appending the lines to that custom log file. The diagram below shows more details, which will be elaborated even more later in the article. For example, you could extract all lines that are WARN and write them to its own log file, or all lines with a specific java class, or all lines with a specific processor name or id … or particular operation, or user, or something more complex. Multiple customized logs in same flow The above shows only one custom logging flow. You can building many custom logs against the same source log file (e.g. one for WARN, one for ERROR, one for pid, and another for username … whatever. You can also build many custom log files against multiple source logs. The diagram below shows multiple customized logs in the same flow. The above flow will create 4 separate custom log files … all written out by the same ExecuteScript. Add a new custom log: Good ol’ copy-paste Note in the above that: each custom log flow is identical except for 3 configuration property changes (if you are adding a custom logging against the same source log, only 2 configuration properties are different) all custom logging reuses ExecuteScript to append to a custom log file: Execute script knows which custom file to append to because each FlowFile it receives has the matched text and its associated custom filename Because all custom log flows are identical except for 3 configs, it is super-fast to create a new flow. Just: copy an existing custom log flow (TailFile through UpdateAttribute) via SHFT-click each processor and connection, or lassoing them with SHFT-mouse drag paste on palette and connect to ExecuteScript change the 3 configs (TailFile > File to Tail; ExtractText > regex; UpdateAttribute > customfilename) Technical note: when you paste to the pallete, each processor and connection retains the same name but is assigned a new uuid, thereby guaranteeing a new instance of each. Tidying up Because you will likely add many custom log flows to the same flow, and each source log file many split into multiple custom log flows (e.g. Source log file 2 above diagram) your pallete may get a bit busy. Dividing each custom log flow into the same logical processor groups helps manage the palette better. It also makes copy-pasting new flows and deleting other custom log flows easier. Here is what it will look like. Now you have your choice of managing entire process groups (copy-paste, delete) or processors inside of them (copy-paste, delete), thus giving you more power to quickly add or remove custom logging to your flow: To add a custom log to an existing TailFile-SplitText: open the processor group and copy-paste ExtractText-UpdateAttribute then make a config change to each processor. To add a custom log against a new source log: copy-paste any two process groups in the subflow, and make 1 config change in the first (TailFile), and necessary changes in the second process group (add/delete custom log flow, 2 config changes for each custom log) Implementation specifics The config changes for each new custom log file The configs that do not change for each new custom log The groovy script import org.apache.commons.io.IOUtils import java.nio.charset.* def flowFile = session.get() if(!flowFile) return filename = customfilename.evaluateAttributeExpressions(flowFile).value f = new File("/Users/gkeys/DEV/staging/${filename}") flowFile = session.write(flowFile, {inputStream, outputStream -> try { f.append(IOUtils.toString(inputStream, StandardCharsets.UTF_8)+'\n') } catch(e) { log.error("Error during processing custom logging: ${filename}", e) } } as StreamCallback) session.transfer(flowFile, REL_SUCCESS) Note that the jar for org.apache.commons.io.IOUtils is placed in Module Directory as set in ExecuteScript Summary That's it. You can build custom logging from the basic flow shown in the first diagram, and then quickly add new ones in just a few seconds. Because they are so easy and fast to build, you could easily build them as throwaways used only during the development of a piece of code or a project. On the other hand this is NiFi, a first-class enterprise technology for data in motion. Surely custom logging has a place in your production environments. Extensions The following could easily be modified or extended: for the given groovy code, it is easy to build rolling logic to your custom log files (would be applied to all custom logs) instead of appending to a local file, you could stream to a hive table (see link below) you could build NiFi alerts against the outputted custom log files (probably best as another flow that tails the custom log and responds to its content) References http://hortonworks.com/apache/nifi/ https://nifi.apache.org/docs/nifi-docs/html/user-guide.html https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html https://community.hortonworks.com/articles/60868/enterprise-nifi-implementing-reusable-components-a.html https://community.hortonworks.com/articles/52856/stream-data-into-hive-like-a-king-using-nifi.html

gkeys · ‎11-04-2016

@swathi thukkaraju I have made a minor correction to my code example above. Each input param in the sqoop_job.sh must be ${1}, ${2} etc instead of $1, $2 etc. So, use sqoop job --create incjob12 --import ${1} --table ${2} --incremental lastmodified -check-column ${3} --target-dir ${4} --m 1--merge-key id and be sure to pass in "$MYSQL" in quotes because the value has spaces in it. (I just tested this and it works)

gkeys · ‎11-03-2016

@Sundar Lakshmanan Glad it helped. Happy Hadooping!

gkeys · ‎11-02-2016

You can do this by running cron type scheduling in oozie and setting day-of-week field to 2-6. These posts show exactly how to do that: http://hortonworks.com/blog/new-in-hdp-2-more-powerful-scheduling-options-in-oozie/ https://community.hortonworks.com/questions/1295/how-to-schedule-an-oozie-job-to-run-at-8pm-edt.html

gkeys · ‎11-01-2016

[answer #2] In this case I would put your sqoop command in a shell script and then pass parameters to that. It is similar to the question above, but everyone uses the same shell script but passes different parameters to it, which are picked up by the sqoop command text in the script. Example Script name: sqoop_job.sh To run script: ./sqoop_job.sh "$MYSQL" st1 ts sqi Script body: sqoop job --create incjob12 --import ${1} --table ${2} --incremental lastmodified -check-column ${3} --target-dir ${4} -m 1 --merge-key id For more on shell scripting and passing parameters: http://linuxcommand.org/writing_shell_scripts.php

gkeys · ‎11-01-2016

I am not sure exactly what your requirements are, but here goes... 1. You can set these parameters as OS environment variables for each of your db connections. For example, MYSQL_CONN=xx, MYSQL_DRIVER=xx, MYSQL_UU=xx, MYSQL_PWD=xx, MYSQL_TARGDIR=xx, ORA_CONN=xx, ORA_DRIVER=xx, ORA_UU=xx, etc. Set these by using the export command, eg. export MYSQL_CONN=xx Then you simply call the db params you want on the command line, eg. sqoop job --create incjob12 --import--connect $MYSQL_CONN --driver $MYSQL_DRIVER --username $MYSQL_UU --password $MYSQL_PWD ... 2. You can do the same thing but with all params related to a db set as a single OS environment variable. Eg. set $MYSQL="--connect jdbc:mysql://localhost/test --driver com.mysql.jdbc.Driver --username it1 --password hadoop" and then run your sqoop job as sqoop job --create incjob12 --import "$MYSQL" --table st1 --incremental lastmodified -check-column ts --target-dirsqin -m 1 --merge-key id Note the quotes in both setting and invoking the OS environment variable MYSQL. This is because there are spaces in the value. If I am not understanding your requirements, let me know.

gkeys · ‎10-31-2016

This article will show you how to set variables per environment, including sensitive values so you can promote a single change-managed template of your full flow from one environment to another without changing contents of the template. Also shows how to build reusable components. (The SDLC model is at the bottom of the article). https://community.hortonworks.com/articles/60868/enterprise-nifi-implementing-reusable-components-a.html Process is: In flow configurations, sensitive values should be configured as Expression Language references to OS environment variables that you set in each environment, e.g. ${MYSQL_PSSWRD}. Other environment-specific config values should similarly use Expression Language references. If these are not sensitive, should be in custom properties file. Developers finalize their flows and submit the template of the flow to version control, eg Git (and also submit custom property files). Template and custom property files are promoted to each environment just as source code typically is. Automation: deploying templates to environments can be done via NiFi RestAPI integrated with other automation tools. Governance bodies decide which configurations can be changed in real-time (e.g. ControlRate properties). These changes do not need to go through verision control and can be made by authorized admins on the fly. The article elaborates on Expression language, OS environment variables and custom property files. If you have specific questions stemming from the article, please continue as comments in this thread. Note: To deal with all flows as one deployable unit, you can make them all a part of a single processing group and change-manage that as a single template. Otherwise you can manage each flow as separate templates. Or you can do something in between: form process groups as logical groups (e.g. team, project, product, line of business) and use these as deployable units.

gkeys · ‎10-31-2016

See the following on creating ORC tables (it is mostly a matter of using "stored as ORC"): http://hadooptutorial.info/hive-table-creation-commands/#Example_3_8211_External_Table_with_ORC_FileFomat_Snappy_Compressed https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL To use sqoop to import to Hive ORC table, see: https://community.hortonworks.com/questions/28060/can-sqoop-be-used-to-directly-import-data-into-an.html

gkeys · ‎10-29-2016

Pig runs map-reduce under the covers and this list of files is the output of a map-reduce job. You should also notice a 0 byte (no contents) file named _SUCCESS at the top of the list. That is just a flag saying the job was a success. Bottom line is that when you point your job or table to the the parent directory holding these files, it simply sees the union of all files together. So you can think logically of the parent directory as the "file" holding the data. Thus, there is never a need to concatenate the files on hadoop -- just point to the parent directory and treat it as the file. So if you make a hive table -- just point to the parent directory. If you load the data into a pig script -- just point to the parent directory. Etc. If you want to pull the data to an edge node, use the command hdfs dfs -getmerge <hdfsParentDir> <localPathAndName> and it will combine all of the m-001, m-002 ... into a single file. If you want to pull it to your local machine, use Ambari File Views, open the parent directory, click "+ Select All" and then click "concatenate". That will concatenate all into one file and download it from your browser. If this is what you are looking for, let me know by accepting the answer; else, let me know of any gaps.

gkeys · ‎10-28-2016

You can. Please see this excellent answer for importing ORC https://community.hortonworks.com/questions/28060/can-sqoop-be-used-to-directly-import-data-into-an.html and this one for exporting ORC https://hadoopist.wordpress.com/2015/09/09/how-to-export-a-hive-orc-table-to-a-oracle-database/

Online	Offline
Last Visited	‎06-11-2019 01:24 AM

Member Since	‎06-20-2016 01:29 PM
Last Visited	‎06-11-2019 01:24 AM
Posts	488
Kudos received	430

Cloudera Community

Re: DR for hadoop

Re: API + how to know by API command all machines ...

Re: Does data get copied in edge node from externa...

Re: is it possible to set the hadoop.tmp.dir value...

Re: How to handle nulls when exporting from Hive?

NiFi: Easy custom logging of diverse sources in me...

Re: how to implement and deploying our own sqoop ...

Re: How to schedule OOZIE job for week days only?

Re: How to schedule OOZIE job for week days only?

Re: how to implement and deploying our own sqoop ...

Re: how to implement and deploying our own sqoop ...

Re: Hortonworks HDF - Apache Nifi - how to deploy ...

Re: how do i create hive orc table and how to ...

Re: Impala -Pig Files - Parquet file?

Re: Can Sqoop export ORC?