About Shu_ashu

Shu_ashu · ‎10-15-2017

Hi @xav webmaster, we can do this by using routeoncontent processors, But in your case the problem is your siteA and siteC having jan and usa as common check on content, you need to come up with unique way how to route content to siteA or siteC. by using RouteOnContent Processor:- This processor also results same output as routetext but in this processor we won't have ignore case property, so we need to prepare regex which matches january or January or jaNuary. The below regex ignores all the case mentioned above .*[Jj][Aa][Nn][Uu][Aa][Rr][Yy].* Configs:- 1.change the properties Match Requirement to content must contain match 2.add the below properties jan and cloudy (.*January.*Cloudy) // checks if the content haiving january and cloudy in it Example:- 2017, michigan, January, rainy, 20, eeuu, Cloudy jan and usa (.*January.*USA) //checks if the content having january and usa in it Example:- 2017, michigan, January, rainy, 20, eeuu, USA jan or feb and usa (January|February).*(USA) //checks if the content having jan or feb and usa Example:- 2017, michigan, February, rainy, 20, eeuu, USA 2017, michigan, January, rainy, 20, eeuu, USA in this case both content satisfies regex Processor Configs:- in addition:- How RouteText processor works? if you are having content with 50 lines in it and we wants to route the content, this processor compares each line with our properties and routes the matching lines to those relationships. i.e 1 input file(multiple lines of content) and multiple outputs based on relations that we specified in processor configurations Example input:- 2017, michigan, January, rainy, 20, eeuu, Cloudy 2017, michigan, January, rainy, 20, eeuu, USA 2017, michigan, February, rainy, 20, eeuu, USA output:- jan and usa relation gets 2017, michigan, January, rainy, 20, eeuu, USA jan and cloudy 2017, michigan, January, rainy, 20, eeuu, Cloudy jan or feb and usa 2017, michigan, January, rainy, 20, eeuu, USA 2017, michigan, February, rainy, 20, eeuu, USA (or) if the content of flowfile will have only 1 line in it, in that case we can use routetext processor and it works same as routeoncontent as mentioned above. Processor Configurations:- 1.change the properties Matching Strategy to Contains Regular Expression Ignore case to true //it ignores UPPER or lower case 2.add properties as follows jan and cloudy (.*January.*Cloudy) // checks if the content haiving january and cloudy in it Example:- 2017, michigan, January, rainy, 20, eeuu, Cloudy jan and usa (.*January.*USA) //checks if the content having january and usa in it Example:- 2017, michigan, January, rainy, 20, eeuu, USA jan or feb and usa (January|February).*(USA) //checks if the content having jan or feb and usa Example:- 2017, michigan, February, rainy, 20, eeuu, USA 2017, michigan, January, rainy, 20, eeuu, USA in this case both content satisfies regex Processor configs:- If your content of flowfile is one line at a time you can use either routetext (or) routeoncontent processors, the results from both of them are same. As you can choose which best fit for your case. if your contents are more than one line at a time then by using routetext we can routes the matching lines to different relations (or) use splittext processor to split each line as a flowfile then use either routetext (or) routeoncontent processors. if your contents are more than one line at a time and you want to route the whole content based on some property then use routeoncontent processor it will compares whole contents(not line by line as routetext) and redirects the whole contents to matching relationships.

Shu_ashu · ‎10-15-2017

@Jonathan Bell Can you add validation query in connection pool as this Validation query used to validate connections before returning them. When connection is invalid, it get's dropped and new valid connection will be returned. Note!! Using validation might have some performance penalty. Query:- select CURRENT_TIMESTAMP Connection pool Configs:- and this validation query will take care of invalid connections and drops invalid connections and re enables connections, it helps you to disable and re enable connection pools.

Shu_ashu · ‎10-14-2017

Hi @Ramya Jayathirtha in hive if you do simple query like select * from table there will be no map reduce job is going to run as we are just dumping the data. Hive# select * from foo; +---------+-----------+----------+--+ | foo.id | foo.name | foo.age | +---------+-----------+----------+--+ | 1 | a | 10 | | 2 | a | 10 | | 3 | b | 10 | | 4 | c | 20 | +---------+-----------+----------+--+ 4 rows selected (0.116 seconds) you can use explain by adding before with your query, it will displays how the query is going to execute by execution engine and display how many map reduce phases are going to be done for the query. Hive# explain select * from foo; +-------------------------------------------------------+--+ | Explain | +-------------------------------------------------------+--+ | Plan not optimized by CBO. | | | | Stage-0 | | Fetch Operator | | limit:-1 | | Select Operator [SEL_5652] | | outputColumnNames:["_col0","_col1","_col2"] | | TableScan [TS_5651] | | alias:foo | | | +-------------------------------------------------------+--+ When ever you do aggregations then the reducer phase will be executed along with map phase. Hive# select count(*) from table group by name; INFO : Map 1: 0/1 Reducer 2: 0/2 INFO : Map 1: 0(+1)/1 Reducer 2: 0/2 INFO : Map 1: 0(+1)/1 Reducer 2: 0/2 INFO : Map 1: 0(+1)/1 Reducer 2: 0/2 INFO : Map 1: 0(+1)/1 Reducer 2: 0/2 INFO : Map 1: 1/1 Reducer 2: 0/1 INFO : Map 1: 1/1 Reducer 2: 0(+1)/1 INFO : Map 1: 1/1 Reducer 2: 1/1 +------+--+ | _c0 | +------+--+ | 2 | | 1 | | 1 | +------+--+ 3 rows selected (13.709 seconds) if you add Explain in front of above query it will displays Hive# explain select count(*) from foo group by name; Reducer 2 <- Map 1 (SIMPLE_EDGE) as you can see reducer phase along with map phase. we can add another reducer phase to above query by adding order by clause to it Hive# select count(*) cnt from foo group by name order by cnt; INFO : Map 1: 0/1 Reducer 2: 0/2 Reducer 3: 0/1 INFO : Map 1: 0(+1)/1 Reducer 2: 0/2 Reducer 3: 0/1 INFO : Map 1: 1/1 Reducer 2: 0/1 Reducer 3: 0/1 INFO : Map 1: 1/1 Reducer 2: 0(+1)/1 Reducer 3: 0/1 INFO : Map 1: 1/1 Reducer 2: 1/1 Reducer 3: 0(+1)/1 INFO : Map 1: 1/1 Reducer 2: 1/1 Reducer 3: 1/1 +------+--+ | cnt | +------+--+ | 1 | | 1 | | 2 | +------+--+ you can see 2 reducer phases are done because after aggregating we are doing order by to the results Map1 phase:- Loads the data from HDFS. Reduer2:- Will does aggregation Reducer 3:- after aggregation it will order the results to ascending order. if you do explain on the above query Hive# explain select count(*) cnt from foo group by name order by cnt; Vertex dependency in root stage Reducer 2 <- Map 1 (SIMPLE_EDGE) Reducer 3 <- Reducer 2 (SIMPLE_EDGE)

Shu_ashu · ‎10-14-2017

@Putta Challa Can you once try using EvaluateXQuery processor Destionation as flowfile-attribute add the below properties:- data //DATA extracts all the data node and keep them as attributes to the flow file. columns //COLUMNS extracts all the columns node and keep them as attributes to the flowfile. all_data string-join((for $x in //DATA return $x/text()), '09') gets all the data node and seperate them with 09 columns_data string-join((for $y in (for $x in /COMPS return string-join(($x/DATA/text() , $x/COLUMNS/text()), '09')) return $y), '09') it joins data and columns node values into one and keep this columns_data as attribute. then use replace text processor to create new flow file. Replacetext configs:- change Replacement Value to ${data.1} //we are having 2 data nodes here we are using data.1 attribute value ${data.2} //data.2 attribute value ${columns} //columns node value ${all_data} //it includes all the data values with 09 separator. ${columns_data} //it includes all the data,columns values with 09 separator. Output:- value11 value12 value13 value14 value15 value21 value22 value23 value24 value25 Column1Column2Column3Column4Column5 value11 value12 value13 value14 value1509value21 value22 value23 value24 value25 value11 value12 value13 value14 value1509value21 value22 value23 value24 value2509Column1Column2Column3Column4Column5 Sample Flow:- Method2:- If you are thinking to just write COLUMNS and DATA to new file that would be easy we can achieve that result by using Replace Text Processor with these properties Change Search Value to [\s\S]{1,}<COLUMNS>(.*)<\/COLUMNS>[\r\n]+<DATA>(.*)<\/DATA>[\r\n]+<DATA>(.*)<\/DATA>[\s\S]{1,} ReplaceText Search Value Config:- and Replacement Value to $1 $2 $3 Here we are replacing all the captured groups in replacement value, it will replaces the content of the flowfile with new content as we mentioned in replacement value property. so this ReplaceText processor gets input file as <?xml version="1.0" encoding="UTF-8" ?> <COMPS ReplyCode="0" ReplyText="Operation Successful"> <COUNT Records="258"/> <DELIMITER value="09"/> <COLUMNS>Column1Column2Column3Column4Column5 </COLUMNS> <DATA>value11 value12 value13 value14 value15</DATA> <DATA>value21 value22 value23 value24 value25</DATA> </COMPS> Output:- Column1Column2Column3Column4Column5 value11 value12 value13 value14 value15 value21 value22 value23 value24 value25 You can use either ways which will best fit for your case :).

Shu_ashu · ‎10-13-2017

Hi @Eric Lloyd, If TailFile processor is configured to Multiple files as Tailing Mode property and Recursive Lookup property to True then if you configured to Run schedule as 10 sec(not necessarily). For the first time when it ran on all nodes then it will tails the files available in these directories and stores the state as file time stamp(you can check the state on by right clicking on the processor --> click on view state button). When this processor runs again after 10sec and checks the files recursively if there is any change in the state of files then it will pulls new files and updates the state in the processor. Example:- i have test.log file bash# ll -rwxrwxrwx 1 nifi nifi 5 Oct 12 18:43 test.log if you check the state in nifi that means nifi converted the file created time i.e 5 Oct 12 18:43 to unixtimestamp in milliseconds and stored in the processor. when it runs again, it compares the stored state in the processor value with created time of the file, if these values differ then it tails that file again and updates the state with new file created time stamp. if these values are same then it won't tails the file. Same way nifi looks recursively in all directories if there is any change in any of the file create time then pulls that file and updates the state. Now, Lets take your case if only 234.2/foo.log is updating and 123.1/foo.log not updating, then processor will only fetches 234.2/foo.log file, it wont fetch 123.1/foo.log because it is not updated. if new directory got created (or) logs got written to new file, it doesn't matter because we are recursively looking for new files that got created after the state stored in the processor and it won't duplicates the files that got fetched before. NiFi will take care of the new files and new directories that got created newly.

Shu_ashu · ‎10-12-2017

@Eric Lloyd, for this case make we cannot mention wild cards like [*] as this processor wont accepting those regex. change FilestoTail property to test[1-2]/[\d|a-z.*]{1,}/test.log Expression Explanation:- test[1-2] --look for test1 or test2 Then property expression check for [\d|a-z.*]{1,} --check for directory name as having digits or letters more than one time and lists all those directories recursively. Configs:- For your case:- Files toTail property should be some thing like below versions/[\d|a-z.*]{1,}/<your-log-file-name>

Shu_ashu · ‎10-12-2017

@Simon Jespersen Can you once try using the below methods to change the filename 1.by using replaceAll string manupulation and adding the filename property as ${filename:replaceAll('.*\_([A-Za-z]{3,5})\_([0-9]{8}).*xxxxxxx.json','$2')}.json 2.another way is using getdelimited Manupulation by adding filename property as ${filename:getDelimitedField(3, '_'):substring(0,8)}.json Both ways results the same output you can choose the best way which will fits for your case. Output:-

Shu_ashu · ‎10-12-2017

@chris herssens After convert Avro to json processor use SplitJson processor with below configurations in it, if your data is having json Array then SplitJson processor split the array and every flowfile would be single record. then use JoltTransform processor to flatten the json record. Flow:- ConsumeKafka--> convertAvroToJSON-->SplitJSON-->JoltTransform

Shu_ashu · ‎10-12-2017

Hi @Eric Lloyd, can you once try to configure processor as what i have tried is:- i have created 2 directories in my /tmp as test1 and test2 and did echo "test">/tmp/test1/test.log echo "test">/tmp/test2/test.log and in the processor i configured Files to tail property as test[1-2]/test.log it will look for both test1 and test2 directories Rolling filename strategy(not required) as test.log as my file names are always test.log Base directory is /tmp i mentioned as Fixed name as Rolling strategy and Recursive lookup true As this processor picks up new data and stores the state for each directories.

Shu_ashu · ‎10-11-2017

@Sumit Sharma, Use UpdateAttribute processor before PutFile processor with below configurations. Add new property to the processor by clicking + sign filename as ${UUID()} So this will replace the filename of the flowfile to UUID which is unique value all the time, it won't replace the file in your directory.

Online	Offline
Last Visited	‎04-04-2021 06:38 PM

Member Since	‎06-08-2017 08:15 PM
Last Visited	‎04-04-2021 06:38 PM
Posts	1,049
Kudos received	516

Cloudera Community

Re: Get column values in comma separated value

Re: nifi Json data using routeonattributeto to spl...

Re: HIVE MANAGED TABLE

Re: CSV file with Duplicate Headers

Re: NIFI - SQL Server Lookup

Re: How to derive by generating conditions with ro...

Re: dbcpconnection pool drops connection.

Re: Hive queries use only mappers or only reducers

Re: EvaluateXPath can't return multiple node value...

Re: How TailFile works with multiple files

Re: In Nifi Tailing multiple directories with the ...

Re: Using regular expression on flowfilename to up...

Re: flatten json using nifi

Re: In Nifi Tailing multiple directories with the ...

Re: Read text file using GetFile Processor and Sp...