Member since
06-20-2016
488
Posts
433
Kudos Received
118
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 3605 | 08-25-2017 03:09 PM | |
| 2517 | 08-22-2017 06:52 PM | |
| 4197 | 08-09-2017 01:10 PM | |
| 8977 | 08-04-2017 02:34 PM | |
| 8950 | 08-01-2017 11:35 AM |
11-05-2016
01:57 AM
6 Kudos
Overview In this article I show how to very quickly build custom
logging against source log files. Adding new custom logging to the flow takes
only seconds: each new custom log is copy-pasted from an existing one and 3 configuration
properties are changed. The result is a
single flow with many diverse custom logs against your system. Use Cases Use cases include: Application
development: permanent or
throway logging focused on debugging software issues or NiFi flows. Production:
logging for focused metrics gathering, audting, faster troubleshooting, alerting All envts:
prefiltering
Splunk ingests to save on cost and filtering out unneeded log data Use your
imagingation … how can this flow pattern make your life better / more effective? Flow Design Single customized log A single customized log is achieved by tailing a log file,
splitting the lines of text, extracting lines that match a regex, specifiying a
custom log filename for the regex-extracted lines, and appending the lines to that
custom log file. The diagram below shows
more details, which will be elaborated even more later in the article. For example, you could extract all lines that are WARN and
write them to its own log file, or all lines with a specific java class, or all
lines with a specific processor name or id … or particular operation, or user,
or something more complex. Multiple customized logs in same flow The above shows only one custom logging flow. You can building many custom logs against the
same source log file (e.g. one for WARN, one for ERROR, one for pid, and
another for username … whatever. You can also build many custom log files against multiple
source logs. The diagram below shows
multiple customized logs in the same flow. The above flow will create 4 separate custom log files … all
written out by the same ExecuteScript. Add a new custom log: Good ol’ copy-paste Note in the above that: each custom log flow is identical except for 3
configuration property changes (if you are adding a custom logging against the
same source log, only 2 configuration properties are different) all custom logging reuses ExecuteScript to
append to a custom log file: Execute script knows which custom file to append
to because each FlowFile it receives has the matched text and its associated
custom filename Because all custom log flows are identical except for 3
configs, it is super-fast to create a new flow.
Just: copy an existing custom log flow (TailFile
through UpdateAttribute) via SHFT-click each processor and connection, or lassoing
them with SHFT-mouse drag paste on palette and connect to ExecuteScript change the 3 configs (TailFile > File to
Tail; ExtractText > regex; UpdateAttribute > customfilename) Technical note: when you paste to the pallete, each
processor and connection retains the same name but is assigned a new uuid,
thereby guaranteeing a new instance of each. Tidying up Because you will likely add many custom log flows to the
same flow, and each source log file many split into multiple custom log flows (e.g.
Source log file 2 above diagram) your pallete may get a bit busy. Dividing each custom log flow into the same logical
processor groups helps manage the palette better. It also makes copy-pasting new flows and
deleting other custom log flows easier. Here is what it will look like. Now you have your choice of managing entire process groups
(copy-paste, delete) or processors inside of them (copy-paste, delete), thus
giving you more power to quickly add or remove custom logging to your flow: To add a custom log to an existing
TailFile-SplitText: open the processor group and copy-paste
ExtractText-UpdateAttribute then make a config change to each processor. To add a custom log against a new source log:
copy-paste any two process groups in the subflow, and make 1 config change in
the first (TailFile), and necessary changes in the second process group
(add/delete custom log flow, 2 config changes for each custom log) Implementation
specifics The config changes for each new custom log file The configs that do not change for each new custom log The groovy script import org.apache.commons.io.IOUtils
import java.nio.charset.*
def flowFile = session.get()
if(!flowFile) return
filename = customfilename.evaluateAttributeExpressions(flowFile).value
f = new File("/Users/gkeys/DEV/staging/${filename}")
flowFile = session.write(flowFile, {inputStream, outputStream ->
try {
f.append(IOUtils.toString(inputStream, StandardCharsets.UTF_8)+'\n')
}
catch(e) {
log.error("Error during processing custom logging: ${filename}", e)
}
} as StreamCallback)
session.transfer(flowFile, REL_SUCCESS) Note that the jar for org.apache.commons.io.IOUtils is placed in Module Directory as set in ExecuteScript Summary That's it. You can build custom logging from the basic flow shown in the first diagram, and then quickly add new ones in just a few seconds. Because they are so easy and fast to build, you could easily build them as throwaways used only during the development of a piece of code or a project. On the other hand this is NiFi, a first-class enterprise technology for data in motion. Surely custom logging has a place in your production environments. Extensions The following could easily be modified or extended:
for the given groovy code, it is easy to build rolling logic to your custom log files (would be applied to all custom logs) instead of appending to a local file, you could stream to a hive table (see link below) you could build NiFi alerts against the outputted custom log files (probably best as another flow that tails the custom log and responds to its content) References http://hortonworks.com/apache/nifi/ https://nifi.apache.org/docs/nifi-docs/html/user-guide.html https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html
https://community.hortonworks.com/articles/60868/enterprise-nifi-implementing-reusable-components-a.html
https://community.hortonworks.com/articles/52856/stream-data-into-hive-like-a-king-using-nifi.html
... View more
Labels:
11-04-2016
12:18 PM
@swathi thukkaraju I have made a minor correction to my code example above. Each input param in the sqoop_job.sh must be ${1}, ${2} etc instead of $1, $2 etc. So, use sqoop job --create incjob12 --import ${1} --table ${2} --incremental lastmodified -check-column ${3} --target-dir ${4} --m 1--merge-key id and be sure to pass in "$MYSQL" in quotes because the value has spaces in it. (I just tested this and it works)
... View more
11-03-2016
01:01 PM
@Sundar Lakshmanan Glad it helped. Happy Hadooping!
... View more
11-02-2016
01:17 PM
1 Kudo
You can do this by running cron type scheduling in oozie and setting day-of-week field to 2-6. These posts show exactly how to do that: http://hortonworks.com/blog/new-in-hdp-2-more-powerful-scheduling-options-in-oozie/ https://community.hortonworks.com/questions/1295/how-to-schedule-an-oozie-job-to-run-at-8pm-edt.html
... View more
11-01-2016
06:53 PM
[answer #2] In this case I would put your sqoop command in a shell script and then pass parameters to that. It is similar to the question above, but everyone uses the same shell script but passes different parameters to it, which are picked up by the sqoop command text in the script. Example Script name: sqoop_job.sh To run script: ./sqoop_job.sh "$MYSQL" st1 ts sqi Script body: sqoop job --create incjob12 --import ${1} --table ${2} --incremental lastmodified -check-column ${3} --target-dir ${4} -m 1 --merge-key id For more on shell scripting and passing parameters: http://linuxcommand.org/writing_shell_scripts.php
... View more
11-01-2016
12:28 PM
I am not sure exactly what your requirements are, but here goes... 1. You can set these parameters as OS environment variables for each of your db connections. For example, MYSQL_CONN=xx, MYSQL_DRIVER=xx, MYSQL_UU=xx, MYSQL_PWD=xx, MYSQL_TARGDIR=xx, ORA_CONN=xx, ORA_DRIVER=xx, ORA_UU=xx, etc. Set these by using the export command, eg. export MYSQL_CONN=xx Then you simply call the db params you want on the command line, eg. sqoop job --create incjob12 --import--connect $MYSQL_CONN --driver $MYSQL_DRIVER --username $MYSQL_UU --password $MYSQL_PWD ... 2. You can do the same thing but with all params related to a db set as a single OS environment variable. Eg. set $MYSQL="--connect jdbc:mysql://localhost/test --driver com.mysql.jdbc.Driver --username it1 --password hadoop" and then run your sqoop job as sqoop job --create incjob12 --import "$MYSQL" --table st1 --incremental lastmodified -check-column ts --target-dirsqin -m 1 --merge-key id Note the quotes in both setting and invoking the OS environment variable MYSQL. This is because there are spaces in the value. If I am not understanding your requirements, let me know.
... View more
10-31-2016
09:53 PM
5 Kudos
This article will show you how to set variables per environment, including sensitive values so you can promote a single change-managed template of your full flow from one environment to another without changing contents of the template. Also shows how to build reusable components. (The SDLC model is at the bottom of the article). https://community.hortonworks.com/articles/60868/enterprise-nifi-implementing-reusable-components-a.html Process is:
In flow configurations, sensitive values should be configured as Expression Language references to OS environment variables that you set in each environment, e.g. ${MYSQL_PSSWRD}. Other environment-specific config values should similarly use Expression Language references. If these are not sensitive, should be in custom properties file. Developers finalize their flows and submit the template of the flow to version control, eg Git (and also submit custom property files). Template and custom property files are promoted to each environment just as source code typically is. Automation: deploying templates to environments can be done via NiFi RestAPI integrated with other automation tools. Governance bodies decide which configurations can be changed in real-time (e.g. ControlRate properties). These changes do not need to go through verision control and can be made by authorized admins on the fly. The article elaborates on Expression language, OS environment variables and custom property files. If you have specific questions stemming from the article, please continue as comments in this thread. Note: To deal with all flows as one deployable unit, you can make them all a part of a single processing group and change-manage that as a single template. Otherwise you can manage each flow as separate templates. Or you can do something in between: form process groups as logical groups (e.g. team, project, product, line of business) and use these as deployable units.
... View more
10-31-2016
01:44 PM
2 Kudos
See the following on creating ORC tables (it is mostly a matter of using "stored as ORC"): http://hadooptutorial.info/hive-table-creation-commands/#Example_3_8211_External_Table_with_ORC_FileFomat_Snappy_Compressed https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL To use sqoop to import to Hive ORC table, see: https://community.hortonworks.com/questions/28060/can-sqoop-be-used-to-directly-import-data-into-an.html
... View more
10-29-2016
07:25 PM
1 Kudo
Pig runs map-reduce under the covers and this list of files is the output of a map-reduce job. You should also notice a 0 byte (no contents) file named _SUCCESS at the top of the list. That is just a flag saying the job was a success. Bottom line is that when you point your job or table to the the parent directory holding these files, it simply sees the union of all files together. So you can think logically of the parent directory as the "file" holding the data. Thus, there is never a need to concatenate the files on hadoop -- just point to the parent directory and treat it as the file. So if you make a hive table -- just point to the parent directory. If you load the data into a pig script -- just point to the parent directory. Etc. If you want to pull the data to an edge node, use the command hdfs dfs -getmerge <hdfsParentDir> <localPathAndName> and it will combine all of the m-001, m-002 ... into a single file. If you want to pull it to your local machine, use Ambari File Views, open the parent directory, click "+ Select All" and then click "concatenate". That will concatenate all into one file and download it from your browser. If this is what you are looking for, let me know by accepting the answer; else, let me know of any gaps.
... View more
10-28-2016
06:43 PM
4 Kudos
You can. Please see this excellent answer for importing ORC https://community.hortonworks.com/questions/28060/can-sqoop-be-used-to-directly-import-data-into-an.html and this one for exporting ORC https://hadoopist.wordpress.com/2015/09/09/how-to-export-a-hive-orc-table-to-a-oracle-database/
... View more