About Shu_ashu

Shu_ashu · ‎11-19-2018

@Wei Wu ListS3 Processor doesn't allow any upstream connections if you want to fetch some files from the S3 object without listS3 processor then you need to keep your filenames as attributes to the flowfile then use FetchS3 object processor to fetch the files from s3 bucket. To get all the filenames from the S3 bucket you need to use some kind of RestApi call or shell script to list out all filenames from the Bucket then extract the filenames as attributes to the flowfile. Then feed the connection to the FetchS3 processor to fetch actual file contents without ListS3 processor. Refer to this link to get more context over these kind of usecase.

Shu_ashu · ‎11-19-2018

@ Justen Starting from NiFi-1.8.0 we are going to have filename same as flowfileuuid so you are able to search based on flowfileuuid to find out the specific filename. If you are using prior to NiFi-1.8 then you can search based on filename using below process instead of clicking on "i" Click on the right search then you are able to view search events box then keep your desired filename overthere then click on search to get only the specific file from the provenance.

Shu_ashu · ‎11-19-2018

@Henrik Olsen Based on Number Of Records To Analyze property value NiFi will analyze those many records (or) based on each flowfile number of records to determine type for the record. If we keep 1million records to analyze if you are having one flowfile with 1 million records then only the value will be considered (or) processor will limit through number of records in the flowfile. . i think there are no null values for the columns that's why NiFi inferavroschema processor not able to add null as default type for some columns(in case of empty spaces they are not treated as null values for the string type).

Shu_ashu · ‎11-19-2018

@Jacob Paul I believe your flowfiles having source-date1 attribute with value 20181119112100. Then change your update attribute property values as source-date as ${source-date1:substring(0,8)} source-time as ${source-date1:substring(8,13)} Then update attribute adds these flowfile attributes for all outgoing flowfiles from UpdateAttribute processor. In addition you can also perform same kind of operation without extracting as attributes using QueryRecord processor. Configure/Enable Record Reader/Writer controller services and use apache-calcite's Substring function to create source-date,source-time columns in the flowfile.

Shu_ashu · ‎11-15-2018

@ravi kargam As a work around Instead of using replaceRegex function use replace function with below configs: UpdateRecord Configs: Replacement Value Strategy Record Path Value /employeeName replace(/employeeName,'[','(') (or) If you are updating only one column value then Replacement Value Strategy Literal Value /employeeName ${field.value:replaceAll('\\[','(')}

Shu_ashu · ‎11-15-2018

@n c Once you white list the param in ambari then you are able to set the parameter in hive cli.

Shu_ashu · ‎11-15-2018

@Henrik Olsen The same exact case is introduced in NiFi-1.6 version jira addressing this bug NiFi-4883. Starting from NiFi-1.6 we are able to use one record writer for invalid records and use different record writer for the valid records.

Shu_ashu · ‎11-15-2018

@Julio Gazeta I think this thread also having same issue hitting max back pressure on the queue. Same fix as described here: https://community.hortonworks.com/questions/227489/apache-nifi-distribution-trouble-in-cluster-spark.html will be applicable for this thread also.

Shu_ashu · ‎11-15-2018

@n c So the month ie "10" is actually appearing as part of the table data. Is that correct? Yes this is correct, when we create partition table we are going to have all partition columns at the end of the column list. Partitions are going to boost the query performance when we are using partition column in out where clause. Example: if you want to count number of records are in mth=10 then select count(*) from test_par_tbl where mth=10; Now the above query won't do full table scan as predicate only scan the mth=10 partition and shows up the result. when dealing with 100's of million datasets partitions will be optimization techniques to boost up the query performances by avoiding full table scans. 2.Even with out partition field in where clause you can still able to run the below query but this will do full table scan select count(*) from test_par_tbl where month(create_dt)=10; Both these queries will give you same results but taking performance as consideration on big data sets first query will run more efficiently. Is it possible to partition the table as above and not have the partition column/value as part of the table data? This is not possible because if you won't have partition column as part of table data then hive will do full table scan on the entire dataset. If you still want to take off the partition column from the dataset, then create a view on top of the partition_table it by excluding the column.

Shu_ashu · ‎11-12-2018

@Varun Yadav I don't think we can upload multiple templates at one time but you can keep all the templates in one folder and then read the filenames and pass each filename(using a loop) to curl api call to upload the template into NiFi canvas.

Online	Offline
Last Visited	‎04-04-2021 06:38 PM

Member Since	‎06-08-2017 08:15 PM
Last Visited	‎04-04-2021 06:38 PM
Posts	1,049
Kudos received	516

Cloudera Community

Re: Get column values in comma separated value

Re: nifi Json data using routeonattributeto to spl...

Re: HIVE MANAGED TABLE

Re: CSV file with Duplicate Headers

Re: NIFI - SQL Server Lookup

Re: NIFI ListS3 doesn't take input

Re: NiFi - View data provenance - FlowFile name

Re: NiFi, ValidateRecord

Re: Nifi Substring : index out of bounds error

Re: NiFi UpdateRecord processor is throwing Patter...

Re: Hive partition - partition column as part of t...

Re: NiFi, ValidateRecord

Re: Large files with nifi in GetFile and PutFile

Re: Hive partition - partition column as part of t...

Re: I need to deploy my NIFI templates (xml files ...