About Shu_ashu

Shu_ashu · ‎11-01-2017

@dhieru singh Yes you can use the view name as table name in query database table processor and it's better to have Maximum-value column(s) then this processor will pull only the new records that got added (or) modified. if we don't have any max value columns mentioned in configurations then processor will keep track of all columns, which could have performance impact. You can view the state of the processor by Right clicking and click on view state button, if you want to clear the state then click on clear state at right corner. Query database table Configs:- I have changed Max Rows Per Flow File as 100K,so the processor will fetches 100K records per file (or) you can leave that as 0(default) will give you all the records in 1 Flow File. Connection pool Configs:- above screenshot connection pool is an example for sql server as you can see the highlighted text, if you don't mention any database in the connection URL then it will connects to default database in source. jdbc:sqlserver://<ip-address-server>:<port> //this connection string we won't mentioned database name so it connects to default database in sql server.

Shu_ashu · ‎10-31-2017

@Hadoop User, Merge content minimum group size depends on your input file size, In merge content processor Change Correlation Attribute Name property to filename //it will binned all the chunks of files that having same filename and merges them. <strong>Minimum Number of Entries</strong> //this is minimum number of flowfiles to include in a bundle and needs to be at least equal to chunk of files that you are getting after split text processor. Maximum Number of Entries max number of flowfiles to include in bundle. <strong>Minimum Group Size minimum size of the bundle</strong>// this should be at least your file size, if not then some of your data will not be merged. Max Bin Age The maximum age of a Bin that will trigger a Bin to be complete. i.e after those many minutes processor flushes out what ever the flowfiles are waiting before the processor. in above screenshot i am having Correlation attribute name property as filename that means all the chunks of files that are having same filename will be grouped as one. Processor waits for minimum 2 files to merge and max is 1000 files and check for min and max group size properties also. if your flow is satisfying these properties then merge content processor won't having any files waiting before merge content processor. if your flow is not met the configurations above then we need to use Max Bin Age property to flush out all the files that are waiting before the processor. as you can see in my conf i have given 1 minute so this processor will wait for 1 minute if it won't find any correlation attributes that will flushes out, in your case you need to define the value as per your requirements. For your reference Ex1:-lets consider your filesize is 100 mb, after split text we are having 1000 chunks of splits then your Merge content configurations will looks like Minimum Number of Entries 1 Maximum Number of Entries 1000 Minimum Group Size 100 MB //atleast equal to your file size. case1:-if one flowfile having 100 mb size then maximum number of entries property ignored because min entries are 1 and min group size is 100 mb it satisfies min requirements then processor will merge that file. case2:-if 1000 flowfiles having 10 mb size then minimum group size property ignored because max entries are 1000 it satisfies max requirements then processor will merge those files. then the 1000 chunks are merged into 1 file. Ex2:-lets consider your filesize is 95 mb, after split text we are having 900 chunks of splits..The challange in this case is processor with above configuration will not merge 900 chunks because it hasn't reached the max group sixe i.e 100 MB but we are having 95 mb but still we need to merge that file in this case you need to use then your Merge content configurations will looks like Minimum Number of Entries 1 Maximum Number of Entries 1000 //equals to chunk of files Minimum Group Size 100 MB //atleast equal to your file size, if one flowfile having 100 mb size then maximum number of entries property ignored because min entries are 1 and min group size is 100 mb it satisfies min requirements then processor will merge that file. case1:-if one flowfile having 100 mb size then maximum number of entries property ignored because min entries are 1 and min group size is 100 mb it satisfies min requirements then processor will merge that file. case2:-if 1000 flowfiles having 10 mb size then minimum group size property ignored because max entries are 1000 it satisfies max requirements then processor will merge those files. --same until here-- Max Bin Age 1 minute we need to add max bin age this property helps if the files are waiting before the processor after 1 minute it will flush out those files then merges them according to filename attribute correlation. By analyzing your get file,split text,replace text processors(size,count), you need to configure merge content processor.

Shu_ashu · ‎10-31-2017

@Hadoop User Instead of replacing one 100 MB file in replace text processor another alternate way is using split text and replace those small chunk of files then merge the chunks into one file using merge content processor. Flow Explanation:- 1.GetFile 2.SplitText(splits relation) //split the text file into your required number of lines 3.ReplaceText(success relation) //replace those split files new lines into one line 4.Mergecontent(merged relation) //group all the splits into one file in this process our input file has been split into small chunks of files then we are replacing those new lines into one line. Flow:-

Shu_ashu · ‎10-31-2017

@dhieru singh Join aggregate function works on only attribute values and concatenate those values with the specific delimiter that means if you are having abc attribute as value hello and xyz having world ${allAttributes("abc", "xyz"):join(" now ")} will result hello now world i.e all the attribute values got concatenated with now. If you want to add attribute then use the below EL will result. temp_${now():format("yyyy-MM-dd-HH-mm-ss")}+${random():mod(10):plus(1)} without join function will result temp_2017-10-31-09-48-10+8

Shu_ashu · ‎10-30-2017

@dhieru singh Yes you are right, now all those flowfiles are with merge content processor and in addition processor needs to keep track of Grouping of flowfiles based on a user-defined strategy.

Shu_ashu · ‎10-30-2017

@Gurinderbeer Singh As we are casting the datatype of the column to varchar() or decimal() after execute sql processor converts as string type. But if you know the data of the column and once you import the data into Directory then you can create hive table column data type as float, double, decimal, string data type you need.

Shu_ashu · ‎10-30-2017

@dhieru singh That is because your merge content processor is running so that means merge content processor is working on those flowfiles. if you want to list those flowfiles then stop merge content processor then only you can view them in nifi. If you dont want to see the queue in red color then click on success relation and click on settings tab, then configure that queue to big number and size as per your needs, by default these configurations are 10000 flowfiles or 1 GB size.

Shu_ashu · ‎10-30-2017

@Hadoop User You need to change Maximum Buffer Size in your replace text processor as by default it is 1 MB When the flowfile size is more than 1 mb then it will route to failure relation. Replace Text Configs:- Search value property as (.*)\n Replacement value as $1 as for testing i kept buffer size as 10 MB but you can change the size. **keep in mind more buffer size will leads to out of memory issues.**

Shu_ashu · ‎10-30-2017

@Hadoop User Use ListFile Processor and run that in cron schedule for every minute, this processor will store the state and wont return any warning if there is no new file. Then you can use FetchFile processor to pull the listed files from ListFile processor. As these processors won't delete the file from your directory once fetch has been done(like getfile processor), if you want to delete those files from directory then use ExecuteStreamCommand processor and write a shell script which can get the filename from to flowfile attribute and pass that attribute to your script. Flow:- 1.ListFile //list all the files from directory. 2.FetchFile //fetch the listed file. 3.ExecuteStreamCommand //shell script to delete file from directory. Refer to below link how to pass attributes to the ExecuteStreamCommand processor script. https://pierrevillard.com/2016/03/09/transform-data-with-apache-nifi/

Shu_ashu · ‎10-30-2017

@sally sally Yes you can do this in several methods using by nifi processors. 1.By using GetHDFS processor(pure nifi processors). 2.By using ListHDFS processor(pure nifi processors). 3.Run Script and add the attributes to the flowfile and use them in FetchHDFS processor. Method 1:- By using GetHDFS processor:- for testing i am having these 4 files in folder2 directory and i want to fetch only file name starting with 2011 hadoop fs -ls /user/yashu/folder2/ Found 4 items -rw-r--r-- 3 hdfs 27 2017-10-30 09:16 /user/yashu/folder2/2011-01-01.1 -rw-r--r-- 3 hdfs 359 2017-10-20 08:47 /user/yashu/folder2/hbase.txt -rw-r--r-- 3 hdfs 24 2017-10-09 21:45 /user/yashu/folder2/sam.txt -rw-r--r-- 3 hdfs 12 2017-10-09 21:45 /user/yashu/folder2/sam1.txt Use GetHDFS processor and change property Keep Source File to true by default is false.//if you want to keep the source in the directory then change property to true. (or) if you want to delete the file after fetching then keep property to false. 2. Give the path of your Directory 3.In File Filter Regex give the regex that matches your required filenames. Ex:- i need only files starting with 2011 so i have given regex as 2011.* this processor now fetches only /user/yashu/folder2/2011-01-01.1 file from directory. Method 2:- using ListHDFS processor:- configure your directory path in list HDFS processor and this processor will list all the files that are in the directory. We cannot filter out the files that we required from listhdfs processor but every flowfile from listhdfs processor will have filename attribute associated with the flowfile. we can make use of filename attribute and use RouteOnAttribute processor. RouteOnAttribute:- Add new property in RouteOnattribute and this processor will works as file filter to filter out the flowfiles. Property:- requiredfilenames ${filename:matches('2011.*')} This property only matches the filenames and routes if they satisfies the expression as above. All the other filenames sam.txt,sam1.txt, ...etc are not ignored only 2011 filename will be routed to the property relation. Flow:- Method 3:- Run Script:- you can run the script and then use some processors(extract text..etc) to extract the filename and path name from the result and use those attributes in FetchHDFS processor.

Online	Offline
Last Visited	‎04-04-2021 06:38 PM

Member Since	‎06-08-2017 08:15 PM
Last Visited	‎04-04-2021 06:38 PM
Posts	1,049
Kudos received	516

Cloudera Community

Re: Get column values in comma separated value

Re: nifi Json data using routeonattributeto to spl...

Re: HIVE MANAGED TABLE

Re: CSV file with Duplicate Headers

Re: NIFI - SQL Server Lookup

Re: Joining tables by creating a view and querying...

Re: regex replace all newlines(\n) in a file at on...

Re: regex replace all newlines(\n) in a file at on...

Re: expression language for concatenating

Re: nifi success queue getting filled up

Re: Apache Nifi ExecuteSQL processor error: org.ap...

Re: nifi success queue getting filled up

Re: regex replace all newlines(\n) in a file at on...

Re: nifi processor cron schedule-avoiding multiple...

Re: Nifi: how to use fileFileter for fetching file...