Member since
06-08-2017
1049
Posts
518
Kudos Received
312
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 11126 | 04-15-2020 05:01 PM | |
| 7028 | 10-15-2019 08:12 PM | |
| 3074 | 10-12-2019 08:29 PM | |
| 11260 | 09-21-2019 10:04 AM | |
| 4191 | 09-19-2019 07:11 AM |
10-30-2017
12:51 PM
@sally sally Yes you can use fetchHdfs processor but you need to have ${path},${filename} attributes associated with the flowfile then FetchHDFS processor will get those attributes and works as expected. To add those attributes associated with the flowfile use UpdateAttribute or extracttext...etc processors to add ${path},${filename} attributes to the flowfile.
... View more
10-30-2017
03:00 AM
1 Kudo
@Shakeel Ahmad
Use ExecuteStreamCommand processor instead of execute processor because execute processor won't allow any incoming connections but ExecuteStreamCommand processor accepts incoming connections. Connect one output stream of ExecuteStreamCommand processor to another one so that the next ExecuteStreamCommand processor only triggers when first got completed after completing the 4th process keep put email notification. ExecuteStream command configs:- Flow:- As i have added 2 ExecuteStreamCommand processors you need to add 4 of them then put email and to trigger first ExecuteStreamCommand processor use generateFlowFile processor and schedule that to run periodically. Another way is instead of using GenerateFlowFile processor to trigger the flow you can use execute process to trigger the flow, then connect the success relation to executestreamcommand processor. Flow:- Executeprocess(success)(1 script)-->ExecuteStreamCommand(output stream)(2 script)-->ExecuteStreamCommand(output stream)(3 script)-->ExecuteStreamCommand(output stream)(4 script)-->putEmail As you want to trigger these scripts one by one so you can use either of the ways to put an Email after executing 4th script.
... View more
10-30-2017
12:19 AM
2 Kudos
@Gurinderbeer Singh To resolve this error find out the column/columns which are causing issues 1. Cast those column/columns 2. Use the whole cast select query in ExecuteSQL processor. instead of select * from <table-name> Use the Cast Select statement select col1,col2,cast(col3 as decimal((n,m))) from <table-name> //give values for n(is the total number of digits (precision)) and m(is the number of digits to the right of the decimal point (scale). Ex:- decimal(10,2) //total number of digit is 10, 2 digits to right of decimal point. (or) select col1,col2,cast(col3 as varchar(n)) from <table-name> //if you are not sure with decimal values then cast as varchar specify value for n(varying-length column of length n). Ex:- varchar(30),varchar(20)... Executesql Configs:-
... View more
10-29-2017
06:53 PM
2 Kudos
@Sushil Ks Yes, that's expected because if you are having ACID properties enabled on the table, then there will be lot of delta files(3645) in HDFS directory. you can check files by using bash# hadoop fs -count -v -t <table-location> Each mapper gets will load 1 file so that is the reason why there are 3645 mappers are launched. If there are lot of delta files in the directory you need to run Major or minor compactions, to reduce number of mappers are launched. These compactions takes a set of existing delta files and rewrites them to a single delta file Types of Compactions in hive:- 1.Minor Compaction:-A ‘minor’ compaction will takes all the delta files and rewrites them to single delta file. This compaction wont take much resources. hive#alter table <table-name> partition(<partition-name>,<nested-partition-name>,..) compact 'minor'; Example:- Here par_buk is the table name having dat is the partition column into 10 buckets and having 1 base file and 3 delta files. bash# hadoop fs -ls /apps/hive/warehouse/par_buk/dat=2017-10-09_12/
Found 4 items
drwxrwxrwx - hdfs 0 2017-10-29 14:14 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/base_314724388
drwxr-xr-x - hdfs 0 2017-10-29 14:19 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/delta_314724389_314724389
drwxr-xr-x - hdfs 0 2017-10-29 14:19 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/delta_314724390_314724390
drwxr-xr-x - hdfs 0 2017-10-29 14:19 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/delta_314724391_31472439 hive# alter table par_buk partition(dat='2017-10-09_12') compact 'minor'; //minor compaction gets all the delta files and rewrites them to single delta file bash# hadoop fs -ls /apps/hive/warehouse/par_buk/dat=2017-10-09_12/
Found 2 items
drwxrwxrwx - hdfs 0 2017-10-29 14:14 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/base_314724388
drwxrwxrwx - hdfs 0 2017-10-29 14:20 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/delta_314724389_314724391 As you can see all delta files are rewritten to single delta file in minor compaction. 2.Major Compaction:-A ‘major’ compaction will takes one or more delta files(same as minor compaction) and the base file for the bucket and rewrites them into a new base file per bucket. Major compaction is more expensive but is more effective. This compaction can take minutes to hours and can consume a lot of disk, network, memory and CPU resources, so they should be invoked carefully. hive# alter table <table-name> partition(<partition-name>,<nested-partition-name>,..) compact 'major'; Example:- bash# hadoop fs -ls /apps/hive/warehouse/par_buk/dat=2017-10-09_12/
Found 2 items
drwxrwxrwx - hdfs 0 2017-10-29 14:14 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/base_314724388
drwxrwxrwx - hdfs 0 2017-10-29 14:20 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/delta_314724389_314724391 hive# alter table par_buk partition(dat='2017-10-09_12') compact 'major'; //major compaction gets all the delta files, base files and rewrites them to single new base file. bash# hadoop fs -ls /apps/hive/warehouse/par_buk/dat=2017-10-09_12/
Found 1 items
drwxrwxrwx - hdfs 0 2017-10-29 14:34 /apps/hive/warehouse/par_buk/dat=2017-10-09_12/base_314724391 As you can see major compaction has rewritten base file and delta file to new base file per bucket. If you want to see the status of compactions you can use hive# show compactions; So once you run Compactions all delta files are rewritten to single file, then there will be less number of mappers are launched. These Compactions helps you to significantly increase query performance.
... View more
10-26-2017
05:16 PM
2 Kudos
@Yahya Najjar, Can you once make sure you are referring to the correct controller service id that having validation query in it. I think you have added dynamic property in execute script processor with controller service id in it and you are using the same controller service when you did your fetch and loading. In your question the controller service id is 7e50134b-1000-115f-d673-ccc9fbd344fe but in the above screenshot controller service id(588fa4bo...etc) is different. I think that is the reason why you are having facing issues still..!!
... View more
10-26-2017
12:14 PM
@Yahya Najjar
Can you add validation query in connection pool, as this Validation query used to validate connections before returning them. When connection is invalid, it get's dropped and new valid connection will be returned. Query: select CURRENT_TIMESTAMP() This validation query will take care of invalid connections and drops invalid connections and re enables connections, Validation Query helps you to maintain connection pools never get disabled. Note:- Using validation might have some performance penalty.
... View more
10-25-2017
06:40 PM
@dhieru singh
Yes we can control by using Scheduling tab in the processor. You can use Scheduling strategy either Timer driven or Cron driven so that processor will run only at that specified time.
... View more
10-25-2017
03:41 AM
@PJ Yeah, it might be that case.Because if you are having large number of records then it will take a lot of time to convert ORC data to csv format and if you compare these two process executing query with insert overwrite directory will perform much faster with no issues and also we can keep what ever delimiter we need and we don't need to worry about size of the data.
... View more
10-24-2017
09:58 PM
@dhieru singh in processor scheduling tab click on Scheduling strategy and select Cron Driven and in Run schedule change the cron expression to 0 1 9 1/1 * ? * it follows your local timezone. So if you want to run at cst time then you need to adjust cron scheduling by cst time. Example:- If i need to run processor at 9:01am CST time and my local time is EST then i need to schedule processor to run at 8:01am EST because EST is 1 hour ahead of CST. Expression would be 0 1 8 1/1 * ? * You can make or evaluate cron expressions by using below link http://www.cronmaker.com/
... View more
10-24-2017
08:34 PM
@Karl Fredrickson, what i mean to say at 4:11 is file creation time stamp in the directory For Example:- bash# hdfs dfs -ls /user/yashu/test_fac/
Found 1 items
-rwxr-xr-x 3 hdfs hdfs 8 2017-10-24 04:11 /user/yashu/test_fac/000000_0 in this example 000000_0 file got created at 2017-10-24 04:11(time stamp). But the processor runs at 4:20 that means above 000000_0 file is going to listed in 4:20 run. if the last modified date is earlier than 4:00 but someone put the files at 4:11? then ListSFTP won't create flow files because it will only pulls new files that got created after the state value.
... View more