About gkeys

gkeys · ‎11-14-2016

The best solution seems to be ... after FetchFile: SplitText processor (into single lines), then ExtractText processor (delete row) to regex match rows you want to discard, and connect to next processor with unmatched ReplaceText processor (drop column) to regex find a column and replace empty value Search Value e.g.: (.*,){2} to match the third column Replacement Value: '' need special attention to delim for first and last columns because there will be only trailing or leading ReplaceText processor (transform nulls) to regex find a column and replace empty value Search Value regex find all nulls Replacement Value: what you want to transfer to From here you can merge lines if you need to batch or keep text split as single lines (records) if you want to stream. A bit challenging on the regex skills, but completely within the realm of regex operations. See the HCC article for a full working example: https://community.hortonworks.com/articles/66861/nifi-etl-removing-columns-filtering-rows-changing.html

gkeys · ‎11-14-2016

PigStorage The PigStorage needs to know the delimiter of your fields. The default delimiter is tab, which is used when you use PigStorage(). You can specify the delimiter, like you did when you used PigStorage(','). If your file is comma-delim and you use PigStorage() it will ignore the commas and see only one field (because it cannot find a tab) ... the commas just happen to be characters in a string. By correctly specifying PigStorage(','), it breaks each line into fields separated by the comma. https://pig.apache.org/docs/r0.9.1/func.html#pigstorage Register The link you mention (https://pig.apache.org/docs/r0.15.0/basic.html#register) is to register UDFs. There are two types of functions in pig: native functions and user define functions (UDFs). Native functions come native to the pig binaries and you do not need to do anything to call them. UDFs you build yourself, into a jar file, and register them so the script can find them. Since PigStorage is a native function, you do not need to register them ... pig will find them. (Thus the link is not relevant to your script). If this is what you were looking for, please let me know by accepting the answer; else, let me know of remaining gaps.

gkeys · ‎11-11-2016

@Gurpreet Singh I updated the previous answer with links to ExecuteScript using python scripts, and the important line: session.transfer(flowFile, REL_SUCCESS)

gkeys · ‎11-11-2016

Map-reduce This is a good high-level (easy) explanation: http://www.thegeekstuff.com/2014/05/map-reduce-algorithm/ To really understand it, you need to dive deep. For example, mapper stage writes to local disk through a buffer which then spills to disk; this intermediate data is sent across the network to reducer(s). To really understand map-reduce (so you can optimize performance) reading this book is a good way to go: http://shop.oreilly.com/product/0636920033448.do You can write your own map-reduce programs but they are typically implemented when you run a hive or pig job. Tez If you are running hive or pig queries, you should run it in tez mode. Tez is an alternative processing engine to map-reduce which is much faster. See: http://hortonworks.com/apache/tez/ http://www.slideshare.net/Hadoop_Summit/w-1205phall1saha

gkeys · ‎11-11-2016

For tez "tasks" represent map operations or reduce operations. A DAG is a full workflow (job) of vertices (processing of tasks) and edges (data movement between vertices). See these links for a more detailed discussion: http://hortonworks.com/blog/expressing-data-processing-in-apache-tez/ https://community.hortonworks.com/questions/32164/question-on-tez-dag-task-and-pig-on-tez.html https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works You can see number of tasks on the console output: You can also see this in Ambari Tez view (and drill down for greater details) See this for understanding Ambari Tez view: https://docs.hortonworks.com/HDPDocuments/Ambari-2.1.2.0/bk_ambari_views_guide/content/section_using_tez_view.html

gkeys · ‎11-10-2016

@Gurpreet Singh If I am understanding the requirement correctly, you should define REL_SUCCESS in the script if you find the file, and this is connected to the processor you want to kick off. Define REL_FAILURE if the file is not found, and kickoff the processor you want for that fail condition. For python example, see: https://community.hortonworks.com/articles/35568/python-script-in-nifi.html https://gist.github.com/mattyb149/89205fcbc6d0e15ba024 Note the line: session.transfer(flowFile, REL_SUCCESS)

gkeys · ‎11-10-2016

Could you include a sample of two records? Please update your question with the sample.

gkeys · ‎11-10-2016

Your requirement is basically how NiFi works. When making a connection between two processors, you are asked (for processors where this is relevant) to make a SUCCESS or FAILURE connection. Thus, from one processor make a success connection to the downstream processor to be triggered on success and then make a separate failure connection to the one to respond to failures (in your case, PutEmail processor). This works the same for connections between processors as it does for connection between process groups. Specifically for ExecuteScript, in the script you define success and failure as follows: session.transfer(flowFile, REL_SUCCESS) // or session.transfer(flowFile, REL_FAILURE) For links to getting started with NiFi, see: http://hortonworks.com/apache/nifi https://nifi.apache.org/docs/nifi-docs http://hortonworks.com/hadoop-tutorial/learning-ropes-apache-nifi https://nifi.apache.org/docs/nifi-docs/html/getting-started.html For ExecuteScript specifically, see: http://funnifi.blogspot.com/2016/02/executescript-processor-hello-world.html

gkeys · ‎11-09-2016

HDP does not support Impala so thus it should not be installed. Hortonworks is committed to Hive LLAP which provides Impala capabilities (in-memory). It comes with the platform out of the box, but needs a few configurations to get it up and running. See: http://hortonworks.com/blog/llap-enables-sub-second-sql-hadoop/ http://hortonworks.com/hadoop-tutorial/interactive-sql-hadoop-hive-llap/ http://hortonworks.com/blog/announcing-apache-hive-2-1-25x-faster-queries-much/ https://cwiki.apache.org/confluence/display/Hive/LLAP

gkeys · ‎11-09-2016

This post shows how to very quickly build separate logs for each processor (or however else you wish to customize logs) https://community.hortonworks.com/articles/65027/nifi-easy-custom-logging-of-diverse-sources-in-mer.html

Online	Offline
Last Visited	‎06-11-2019 01:24 AM

Member Since	‎06-20-2016 01:29 PM
Last Visited	‎06-11-2019 01:24 AM
Posts	488
Kudos received	430

Cloudera Community

Re: DR for hadoop

Re: API + how to know by API command all machines ...

Re: Does data get copied in edge node from externa...

Re: is it possible to set the hadoop.tmp.dir value...

Re: How to handle nulls when exporting from Hive?

Re: How to delete a row/drop a column?

Re: Pig Script ,PigStorage function

Re: Calling Process Dynamic

Re: Easy explaination on Map Reduce phase - From I...

Re: Identify number of Mappers & Reducers launched...

Re: Calling Process Dynamic

Re: 'ReplaceText" processor does not replace speci...

Re: Calling Process Dynamic

Re: Impala - alternative way to install

Re: How to debug each nifi processor ?