Member since
06-20-2016
488
Posts
433
Kudos Received
118
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3160 | 08-25-2017 03:09 PM | |
2021 | 08-22-2017 06:52 PM | |
3490 | 08-09-2017 01:10 PM | |
8181 | 08-04-2017 02:34 PM | |
8207 | 08-01-2017 11:35 AM |
11-14-2016
07:05 PM
The best solution seems to be ... after FetchFile:
SplitText processor (into single lines), then ExtractText processor (delete row) to regex match rows you want to discard, and connect to next processor with unmatched ReplaceText processor (drop column) to regex find a column and replace empty value Search Value e.g.: (.*,){2} to match the third column Replacement Value: '' need special attention to delim for first and last columns because there will be only trailing or leading ReplaceText processor (transform nulls) to regex find a column and replace empty value Search Value regex find all nulls Replacement Value: what you want to transfer to From here you can merge lines if you need to batch or keep text split as single lines (records) if you want to stream. A bit challenging on the regex skills, but completely within the realm of regex operations. See the HCC article for a full working example: https://community.hortonworks.com/articles/66861/nifi-etl-removing-columns-filtering-rows-changing.html
... View more
11-14-2016
05:48 PM
2 Kudos
PigStorage The PigStorage needs to know the delimiter of your fields. The default delimiter is tab, which is used when you use PigStorage(). You can specify the delimiter, like you did when you used PigStorage(','). If your file is comma-delim and you use PigStorage() it will ignore the commas and see only one field (because it cannot find a tab) ... the commas just happen to be characters in a string. By correctly specifying PigStorage(','), it breaks each line into fields separated by the comma. https://pig.apache.org/docs/r0.9.1/func.html#pigstorage Register The link you mention (https://pig.apache.org/docs/r0.15.0/basic.html#register) is to register UDFs. There are two types of functions in pig: native functions and user define functions (UDFs). Native functions come native to the pig binaries and you do not need to do anything to call them. UDFs you build yourself, into a jar file, and register them so the script can find them. Since PigStorage is a native function, you do not need to register them ... pig will find them. (Thus the link is not relevant to your script). If this is what you were looking for, please let me know by accepting the answer; else, let me know of remaining gaps.
... View more
11-11-2016
05:50 PM
@Gurpreet Singh I updated the previous answer with links to ExecuteScript using python scripts, and the important line: session.transfer(flowFile, REL_SUCCESS)
... View more
11-11-2016
04:09 PM
Map-reduce This is a good high-level (easy) explanation: http://www.thegeekstuff.com/2014/05/map-reduce-algorithm/ To really understand it, you need to dive deep. For example, mapper stage writes to local disk through a buffer which then spills to disk; this intermediate data is sent across the network to reducer(s). To really understand map-reduce (so you can optimize performance) reading this book is a good way to go: http://shop.oreilly.com/product/0636920033448.do You can write your own map-reduce programs but they are typically implemented when you run a hive or pig job. Tez If you are running hive or pig queries, you should run it in tez mode. Tez is an alternative processing engine to map-reduce which is much faster. See: http://hortonworks.com/apache/tez/ http://www.slideshare.net/Hadoop_Summit/w-1205phall1saha
... View more
11-11-2016
02:17 PM
1 Kudo
For tez "tasks" represent map operations or reduce operations. A DAG is a full workflow (job) of vertices (processing of tasks) and edges (data movement between vertices). See these links for a more detailed discussion: http://hortonworks.com/blog/expressing-data-processing-in-apache-tez/
https://community.hortonworks.com/questions/32164/question-on-tez-dag-task-and-pig-on-tez.html https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works You can see number of tasks on the console output: You can also see this in Ambari Tez view (and drill down for greater details) See this for understanding Ambari Tez view: https://docs.hortonworks.com/HDPDocuments/Ambari-2.1.2.0/bk_ambari_views_guide/content/section_using_tez_view.html
... View more
11-10-2016
06:34 PM
@Gurpreet Singh If I am understanding the requirement correctly, you should define REL_SUCCESS in the script if you find the file, and this is connected to the processor you want to kick off. Define REL_FAILURE if the file is not found, and kickoff the processor you want for that fail condition. For python example, see: https://community.hortonworks.com/articles/35568/python-script-in-nifi.html
https://gist.github.com/mattyb149/89205fcbc6d0e15ba024 Note the line: session.transfer(flowFile, REL_SUCCESS)
... View more
11-10-2016
01:42 PM
Could you include a sample of two records? Please update your question with the sample.
... View more
11-10-2016
01:13 PM
2 Kudos
Your requirement is basically how NiFi works. When making a connection between two processors, you are asked (for processors where this is relevant) to make a SUCCESS or FAILURE connection. Thus, from one processor make a success connection to the downstream processor to be triggered on success and then make a separate failure connection to the one to respond to failures (in your case, PutEmail processor). This works the same for connections between processors as it does for connection between process groups. Specifically for ExecuteScript, in the script you define success and failure as follows: session.transfer(flowFile, REL_SUCCESS)
// or
session.transfer(flowFile, REL_FAILURE)
For links to getting started with NiFi, see: http://hortonworks.com/apache/nifi https://nifi.apache.org/docs/nifi-docs http://hortonworks.com/hadoop-tutorial/learning-ropes-apache-nifi https://nifi.apache.org/docs/nifi-docs/html/getting-started.html For ExecuteScript specifically, see: http://funnifi.blogspot.com/2016/02/executescript-processor-hello-world.html
... View more
11-09-2016
03:57 PM
1 Kudo
HDP does not support Impala so thus it should not be installed. Hortonworks is committed to Hive LLAP which provides Impala capabilities (in-memory). It comes with the platform out of the box, but needs a few configurations to get it up and running. See: http://hortonworks.com/blog/llap-enables-sub-second-sql-hadoop/ http://hortonworks.com/hadoop-tutorial/interactive-sql-hadoop-hive-llap/ http://hortonworks.com/blog/announcing-apache-hive-2-1-25x-faster-queries-much/ https://cwiki.apache.org/confluence/display/Hive/LLAP
... View more
11-09-2016
02:25 PM
1 Kudo
This post shows how to very quickly build separate logs for each processor (or however else you wish to customize logs) https://community.hortonworks.com/articles/65027/nifi-easy-custom-logging-of-diverse-sources-in-mer.html
... View more