About Shu_ashu

Shu_ashu · ‎10-30-2017

@sally sally Yes you can use fetchHdfs processor but you need to have ${path},${filename} attributes associated with the flowfile then FetchHDFS processor will get those attributes and works as expected. To add those attributes associated with the flowfile use UpdateAttribute or extracttext...etc processors to add ${path},${filename} attributes to the flowfile.

Shu_ashu · ‎10-30-2017

@Shakeel Ahmad Use ExecuteStreamCommand processor instead of execute processor because execute processor won't allow any incoming connections but ExecuteStreamCommand processor accepts incoming connections. Connect one output stream of ExecuteStreamCommand processor to another one so that the next ExecuteStreamCommand processor only triggers when first got completed after completing the 4th process keep put email notification. ExecuteStream command configs:- Flow:- As i have added 2 ExecuteStreamCommand processors you need to add 4 of them then put email and to trigger first ExecuteStreamCommand processor use generateFlowFile processor and schedule that to run periodically. Another way is instead of using GenerateFlowFile processor to trigger the flow you can use execute process to trigger the flow, then connect the success relation to executestreamcommand processor. Flow:- Executeprocess(success)(1 script)-->ExecuteStreamCommand(output stream)(2 script)-->ExecuteStreamCommand(output stream)(3 script)-->ExecuteStreamCommand(output stream)(4 script)-->putEmail As you want to trigger these scripts one by one so you can use either of the ways to put an Email after executing 4th script.

Shu_ashu · ‎10-30-2017

@Gurinderbeer Singh To resolve this error find out the column/columns which are causing issues 1. Cast those column/columns 2. Use the whole cast select query in ExecuteSQL processor. instead of select * from <table-name> Use the Cast Select statement select col1,col2,cast(col3 as decimal((n,m))) from <table-name> //give values for n(is the total number of digits (precision)) and m(is the number of digits to the right of the decimal point (scale). Ex:- decimal(10,2) //total number of digit is 10, 2 digits to right of decimal point. (or) select col1,col2,cast(col3 as varchar(n)) from <table-name> //if you are not sure with decimal values then cast as varchar specify value for n(varying-length column of length n). Ex:- varchar(30),varchar(20)... Executesql Configs:-

Shu_ashu · ‎10-29-2017

@Sushil Ks Yes, that's expected because if you are having ACID properties enabled on the table, then there will be lot of delta files(3645) in HDFS directory. you can check files by using bash# hadoop fs -count -v -t <table-location> Each mapper gets will load 1 file so that is the reason why there are 3645 mappers are launched. If there are lot of delta files in the directory you need to run Major or minor compactions, to reduce number of mappers are launched. These compactions takes a set of existing delta files and rewrites them to a single delta file Types of Compactions in hive:- 1.Minor Compaction:-A ‘minor’ compaction will takes all the delta files and rewrites them to single delta file. This compaction wont take much resources. hive#alter table <table-name> partition(<partition-name>,<nested-partition-name>,..) compact 'minor'; Example:- Here par_buk is the table name having dat is the partition column into 10 buckets and having 1 base file and 3 delta files. bash# hadoop fs -ls /apps/hive/warehouse/par_buk/dat=2017-10-09_12/ Found 4 items drwxrwxrwx - hdfs 0 2017-10-29 14:14 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/base_314724388 drwxr-xr-x - hdfs 0 2017-10-29 14:19 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/delta_314724389_314724389 drwxr-xr-x - hdfs 0 2017-10-29 14:19 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/delta_314724390_314724390 drwxr-xr-x - hdfs 0 2017-10-29 14:19 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/delta_314724391_31472439 hive# alter table par_buk partition(dat='2017-10-09_12') compact 'minor'; //minor compaction gets all the delta files and rewrites them to single delta file bash# hadoop fs -ls /apps/hive/warehouse/par_buk/dat=2017-10-09_12/ Found 2 items drwxrwxrwx - hdfs 0 2017-10-29 14:14 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/base_314724388 drwxrwxrwx - hdfs 0 2017-10-29 14:20 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/delta_314724389_314724391 As you can see all delta files are rewritten to single delta file in minor compaction. 2.Major Compaction:-A ‘major’ compaction will takes one or more delta files(same as minor compaction) and the base file for the bucket and rewrites them into a new base file per bucket. Major compaction is more expensive but is more effective. This compaction can take minutes to hours and can consume a lot of disk, network, memory and CPU resources, so they should be invoked carefully. hive# alter table <table-name> partition(<partition-name>,<nested-partition-name>,..) compact 'major'; Example:- bash# hadoop fs -ls /apps/hive/warehouse/par_buk/dat=2017-10-09_12/ Found 2 items drwxrwxrwx - hdfs 0 2017-10-29 14:14 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/base_314724388 drwxrwxrwx - hdfs 0 2017-10-29 14:20 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/delta_314724389_314724391 hive# alter table par_buk partition(dat='2017-10-09_12') compact 'major'; //major compaction gets all the delta files, base files and rewrites them to single new base file. bash# hadoop fs -ls /apps/hive/warehouse/par_buk/dat=2017-10-09_12/ Found 1 items drwxrwxrwx - hdfs 0 2017-10-29 14:34 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/base_314724391 As you can see major compaction has rewritten base file and delta file to new base file per bucket. If you want to see the status of compactions you can use hive# show compactions; So once you run Compactions all delta files are rewritten to single file, then there will be less number of mappers are launched. These Compactions helps you to significantly increase query performance.

Shu_ashu · ‎10-26-2017

@Yahya Najjar, Can you once make sure you are referring to the correct controller service id that having validation query in it. I think you have added dynamic property in execute script processor with controller service id in it and you are using the same controller service when you did your fetch and loading. In your question the controller service id is 7e50134b-1000-115f-d673-ccc9fbd344fe but in the above screenshot controller service id(588fa4bo...etc) is different. I think that is the reason why you are having facing issues still..!!

Shu_ashu · ‎10-26-2017

@Yahya Najjar Can you add validation query in connection pool, as this Validation query used to validate connections before returning them. When connection is invalid, it get's dropped and new valid connection will be returned. Query: select CURRENT_TIMESTAMP() This validation query will take care of invalid connections and drops invalid connections and re enables connections, Validation Query helps you to maintain connection pools never get disabled. Note:- Using validation might have some performance penalty.

Shu_ashu · ‎10-25-2017

@dhieru singh Yes we can control by using Scheduling tab in the processor. You can use Scheduling strategy either Timer driven or Cron driven so that processor will run only at that specified time.

Shu_ashu · ‎10-25-2017

@PJ Yeah, it might be that case.Because if you are having large number of records then it will take a lot of time to convert ORC data to csv format and if you compare these two process executing query with insert overwrite directory will perform much faster with no issues and also we can keep what ever delimiter we need and we don't need to worry about size of the data.

Shu_ashu · ‎10-24-2017

@dhieru singh in processor scheduling tab click on Scheduling strategy and select Cron Driven and in Run schedule change the cron expression to 0 1 9 1/1 * ? * it follows your local timezone. So if you want to run at cst time then you need to adjust cron scheduling by cst time. Example:- If i need to run processor at 9:01am CST time and my local time is EST then i need to schedule processor to run at 8:01am EST because EST is 1 hour ahead of CST. Expression would be 0 1 8 1/1 * ? * You can make or evaluate cron expressions by using below link http://www.cronmaker.com/

Shu_ashu · ‎10-24-2017

@Karl Fredrickson, what i mean to say at 4:11 is file creation time stamp in the directory For Example:- bash# hdfs dfs -ls /user/yashu/test_fac/ Found 1 items -rwxr-xr-x 3 hdfs hdfs 8 2017-10-24 04:11 /user/yashu/test_fac/000000_0 in this example 000000_0 file got created at 2017-10-24 04:11(time stamp). But the processor runs at 4:20 that means above 000000_0 file is going to listed in 4:20 run. if the last modified date is earlier than 4:00 but someone put the files at 4:11? then ListSFTP won't create flow files because it will only pulls new files that got created after the state value.

Online	Offline
Last Visited	‎04-04-2021 06:38 PM

Member Since	‎06-08-2017 08:15 PM
Last Visited	‎04-04-2021 06:38 PM
Posts	1,049
Kudos received	516

Cloudera Community

Re: Get column values in comma separated value

Re: nifi Json data using routeonattributeto to spl...

Re: HIVE MANAGED TABLE

Re: CSV file with Duplicate Headers

Re: NIFI - SQL Server Lookup

Re: Nifi:Can i use FetchHdfs Without ListHdf...

Re: NIFI ExecuteProcess linkage and Consolidated o...

Re: Apache Nifi ExecuteSQL processor error: org.ap...

Re: Hive ACID Table count query launching too many...

Re: Controller Service is disabled

Re: Controller Service is disabled

Re: Use ExcuteProcess to a shell script file

Re: convert orc table data into csv

Re: Setting the time in the Run Schedule of a proc...

Re: NiFi ListSFTP and timestamps