Member since
07-25-2016
55
Posts
28
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4107 | 07-26-2016 12:31 AM |
12-20-2016
07:40 AM
Thanks @mgilman That was helpful. So for ReportingTask, how frequent should I run it? Lets say I am running every 10 seconds, and we receive 100 errors/Warn/info bulletins every minute (just for example), do you think we might loose messages, are we guaranteed to receive all bulletins?
... View more
12-19-2016
05:02 PM
Thanks mgilman: Yes that is what I am trying to achieve, a ReportingTask that is fetching events from BulletinRepository. The issue is that it always returns at-max 5, so I cannot report all arrors (assume you have more then 5 errors for a processor etc) so which API should I use to extract all bulletins? essentially, how to do "exfil the bulletins as they occur"? It would be great if you could elaborate, thanks
... View more
12-18-2016
08:52 PM
@Bryan Bende: I wrote a ReportingTask to retrieve messages from BulletinRepository and report Error/Warn/Info metrics, however the repository would always return at-max 5 messages per component no matter how I use the API 😞 (https://community.hortonworks.com/questions/72411/nifi-bulletinrepository-api-returns-maximum-5-bull.html#answer-72488) Do you think this might be by design? So would you recommend Nifi REST api or should I look for any other solution (start monitoring actual logs etc)
... View more
12-18-2016
08:46 PM
Ok, actually I verified that I am still getting total of 5 messages, no matter how many times I call it, apologies ! So now, I need to re-think how I monitor errors. Do you think through Rest API I would be able to get all error messages? Or should I monitor logs for errors?
... View more
12-18-2016
08:17 PM
1 Kudo
UPDATE: FOLLOWING SOLUTION DOSENT WORKS : We always get at-max 5 messages, Thanks @Timothy Spann: Here is what I found out (there is a bulletin id maintained for each bulletin messages, and it is always increasing. By using .after(id) I am able to fetch all messages in repeated calls: /**
* Retrieve bulletins from ButtetinRepository
* These bulletin messages are used to generate system health metrics (Errors/Warns/Info)
* @param context
* @param previousBulletinId
* @param maxBulletins
* @return
*/
public static List<Bulletin> findBulletins(ReportingContext context, ComponentType componentType, long previousBulletinId, final int maxBulletins){
ArrayList<Bulletin> bulletinsList = new ArrayList<>();
BulletinRepository repository = context.getBulletinRepository();
int bulletinsFound = 0;
do{
bulletinsFound = 0;
final BulletinQuery queryProcessor= new BulletinQuery.Builder().sourceType(componentType).after(previousBulletinId).build();
List<Bulletin> bulletinsThisQuery = repository.findBulletins(queryProcessor);
if(bulletinsThisQuery != null && bulletinsThisQuery.size() > 0){
bulletinsFound = bulletinsThisQuery.size();
previousBulletinId = bulletinsThisQuery.get(0).getId(); /** Retrieve bulletin id*/
bulletinsList.addAll(bulletinsThisQuery);
}
}
while(bulletinsFound > 0 && bulletinsList.size() < maxBulletins);
return bulletinsList;
}
... View more
12-18-2016
01:12 AM
1 Kudo
Hi, I have written a ReportingTask service for Nifi, where I use 'BulletinRepository.findBulletins(queryProcessor);' to retrieve all bulletins and report metrics regarding number of errors/warns/info. The issue is that I am receiving at-max 5 bulletins/messages per component though more messages are viewable through the Bulletin UI. Following is how I construct and execute my query: BulletinRepository repository = context.getBulletinRepository();
final BulletinQuery queryProcessor= new BulletinQuery.Builder().
sourceType(ComponentType.PROCESSOR).
limit(500).build();
bulletinsList.addAll(repository.findBulletins(queryProcessor));
How do I get ALL the messages from Bulletin repository rather then just receiving 5? Thanks Obaid
... View more
Labels:
- Labels:
-
Apache NiFi
11-29-2016
06:37 AM
Following what Bryan Bende mentioned (in the case of a cluster), You need to make sure all cluster nodes are a part of the policy. In my case, I created a new Group 'Cluster' and added all the nodes in this group. Then I went ahead and added this Group to a processor group (added this group for pilicies: view the data and modify the data)
... View more
11-15-2016
06:17 AM
Awesome, thanks for the pointers,
... View more
11-15-2016
06:13 AM
@Joshua Adeleke : Yes I was able to send Email alerts through PutEmail (some time back), however I dont use it for alerts (actually still looking for a better solution). Current approach: I implemented a ReportingTask, to send metrics to InfluxDB (particularly success/failure connection metrics) and use Capacitor for alerts (and you could use any other system to monitor metrics). So for example you could issue alerts if there are flowfiles landing on any Processor's failure channel (dosent works all the time since some processors dont have failure relationships). However, for a more better approach, checkout Bryan's comments above !
... View more
11-15-2016
12:13 AM
Thanks a lot @Bryan Bende for sharing your thoughts, - Could you also recommend how we should monitor dataflows for detecting all failures (Other then PutEmail, would you also recommend monitoring Nifi Logs, or do you think PutEmail is a good enough solution)? - Another idea I wanted to discuss/share: Write a reporting task, and report failures/errors to configured Email/Slack etc, this way you would not need to hookup PutEmail with each processor (considering you have many processors, connecting all with PutEmail make it look complicated/complex). by default, you could get alerts for any failure without configuring/changing flows, Any thoughts ? Thanks again Obaid
... View more
11-11-2016
07:32 PM
Thanks, that was very helpful,
... View more
11-11-2016
07:13 PM
1 Kudo
Hi, I am using QueryDatabaseTable (on Nifi 1.0) to fetch rows from MySQL DB table. To reduce redundent executions, I have scheduled it on Primary node. - However, doing so prevents me from scheduling this processor on a fixed-time since there is no option to do so (like CRON option, where you could specify a fixed time, say 9AM every day). I want to schedule it on fixed time, like 9AM every day. Is that possible? Any workaround? Thanks Obaid
... View more
Labels:
- Labels:
-
Apache NiFi
11-11-2016
05:29 PM
1 Kudo
Hi All, I tried using ExecuteSQL to select rows from my MySQL table: CREATE TABLE table_name (
domain_id mediumint(8) UNSIGNED NOT NULL,
....
run_date timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
) And get following error (This error is caused by domain_id column, i.e I get the error only if I include domain_id in the select query, otherwise it works just fine): 09:24:25 PST ERROR fa363402-bbc0-1802-ffff-ffff96aa31f4 <host>:9090
ExecuteSQL[id=fa363402-bbc0-1802-ffff-ffff96aa31f4] ExecuteSQL[id=fa363402-bbc0-1802-ffff-ffff96aa31f4] failed to process session due to org.apache.avro.file.DataFileWriter$AppendWriteException: org.apache.avro.UnresolvedUnionException: Not in union ["null","long"]: 141419: org.apache.avro.file.DataFileWriter$AppendWriteException: org.apache.avro.UnresolvedUnionException: Not in union ["null","long"]: 141419
Is this a bug or I might not be using this correctly? Is there any workaround for this issue? Thanks Obaid
... View more
Labels:
- Labels:
-
Apache NiFi
11-11-2016
12:24 AM
Hi all, We have been using Jenkins for scheduling jobs, hence it is easy to schedule a job (or jobs, define dependencies etc) and keep track of each job run i.e if a job fails you get an alert etc. Hence for Operation teams, Jenkins is an easy platform to manage/keep track of jobs. I have following questions: 1. For scheduling jobs, what is the best tool: Jenkins or Nifi? 2. How could you operationalize a dataflow like you can in Jenkins? Meaning, if any individual dataflow fails, Operation team gets an alert, so they have complete visibility on each job run? 3. Can I (or should I, meaning does it sound reasonable) use Jenkins to launch DataFlows on Nifi? Just to let Operation team have a single UI to keep track of all jobs ! 4. How can we track the status of each DataFlow run on Nifi? Thanks Obaid
... View more
Labels:
- Labels:
-
Apache NiFi
11-09-2016
04:20 PM
Hi all, I have just started playing around with Apache Ambari. The goal is to install HDF 2.x. I have one question: I have an existing Zookeeper cluster which is being already used by applications, and I want to use this cluster for HDF. Is there a way to tell Ambari to use an existing zookeeper cluster (without having Ambari to install anything since Zookeeper cluster is already up and running) ? Thanks Obaid
... View more
Labels:
11-04-2016
07:42 PM
Hi All, I have a CSV file which contains 3 empty lines at the end of the file. Is there a way to remove these from the end of file? I mean, the file has multiple rows, and I dont split the file. Was wondering if I could remove the spaces without splitting. Thanks Obaid
... View more
Labels:
- Labels:
-
Apache NiFi
11-03-2016
01:35 PM
sure, no problem
... View more
11-03-2016
01:33 PM
Great, Thanks for your response, Do you think that there is a relationship with Cores and RAM, meaning if you have X cores then you should have X+ RAM etc, is there any dependency or good practice? We can think of minimum requirements, assuming we will be running a lot of light-weight flows (batch, scheduled). I mean, more cores will let us run more flows, so just thinking if 32GB RAM will be enough for 20cores if I go for HDF2.x.x. Say in the future all 20cores become busy, then would RAM be an issue? Thanks
... View more
11-01-2016
10:31 PM
Hi All, - This seems like an obvious question, so forgive me if it is redundant: What hardware configurations would be suitable for setting-up HDF 2.x on VMs for 8 node cluster? - I found an old document which does help: link - It seems like Nifi might need more cores vs RAM. My current setup of 12GB/node and 6cores/Node is not working (note: Master has 6GM RAM, which seems like a bottlenect). - After going-through the link, I am thinking of following , but not sure if this is optimal: 24 cores vs 20GB RAM vs 250-500GB Disk. Does it seems like an optimal configurations (consider the ratio, more cores vs more RAM?)? To give more context, I currently don't have any specific throughput requirements, and using Nifi for some batch jobs/log processing etc, however I do want to have a stable cluster setup which we could also use in future if the use increases. Thanks Obaid
... View more
Labels:
10-29-2016
06:20 PM
Thanks a lot @Attila Kanto for a detailed response, Let me ask another cost related question, which is an important factor for making a decision on which technology to use: How would you compare EMR vs Cloudbreak (or Hortonworks Data Cloud) in-terms of cost? Obaid
... View more
10-24-2016
11:58 PM
Hi, I just recently came across Hortonworks Data Cloud: http://hortonworks.github.io/hdp-aws/. And was curious whether it could also launch HDF cluster (basically Nifi cluster), if not then is there a plan to add support for it? Thanks Obaid
... View more
Labels:
- Labels:
-
Apache NiFi
-
Hortonworks Cloudbreak
10-24-2016
11:22 PM
Thanks @Dominika B, Thanks for sharing the link, seems interesting. So I have a very basic question: Amazon EMR lets you launch manage Hadoop and Spark clusters, so what would be the benefit of using Hortonworks cloud vs just using EMR? Thanks Obaid
... View more
10-23-2016
08:04 AM
3 Kudos
Hi all, I am a newbie to HDP and cloudbreak. I want to move some of our onsite Hadoop clusters/jobs on AWS. Two solutions that I have came-across are Cloudbreak and EMR, however not sure which one to use. I wanted to know which technology to use for launching hadoop jobs on AWS? Pros and cons of using either approach would be really helpful (interms of cost, ease of use, monitoring, metrics, latency etc). One apparent cost optimization feature that I am interested in : is to launch the cluster whenever a job or jobs needs to run, and kill the cluster/nodes whenever there are no more jobs to execute. Thanks Obaid
... View more
Labels:
- Labels:
-
Apache Ambari
-
Hortonworks Cloudbreak
09-27-2016
08:28 PM
1 Kudo
Hi all, I have a scenario where I want to trigger a signal for a flow to start processing whenever there is some data available on S3 to process. In such a scenario, all processors will be EventDriven (except for the trigger), and only run if there is any data to process (or somehow we trigger them to start processing). Scenario: - Whenever a file(or files) lands on S3, launch SQL queries (create table, copy data etc) So, what would be a good way for defining such a trigger? Thanks
... View more
Labels:
- Labels:
-
Apache NiFi
09-22-2016
05:34 PM
Thanks a lot @Matt Burgess and @Pierre Villard for a quick response,
... View more
09-22-2016
05:25 PM
Hi, I have a scenario where I want to ignore flow files if an attribute of a flowfile contains invalid value (like filename contains invalid value i.e name of a directory rather then a filename) Is there a way to totally discard/ignore a FlowFile on the base of an attribute value? Thanks Obaid
... View more
Labels:
- Labels:
-
Apache NiFi
09-21-2016
07:58 PM
3 Kudos
Hi, I have a Json message ''store' which contains an array of 'books'. I want to calculate sum/average of all book prices. Is there a way to do it in Nifi? I explored JsonPath and JOLT, however so far I haven't found a way to do it. Thanks. Input: { "store": {
"books": [
{ "category": "reference",
"author": "Nigel Rees",
"title": "Sayings of the Century",
"price": 8.95
},
{ "category": "fiction",
"author": "Evelyn Waugh",
"title": "Sword of Honour",
"price": 12.99
},
{ "category": "fiction",
"author": "Herman Melville",
"title": "Moby Dick",
"isbn": "0-553-21311-3",
"price": 8.99
},
{ "category": "fiction",
"author": "J. R. R. Tolkien",
"title": "The Lord of the Rings",
"isbn": "0-395-19395-8",
"price": 22.99
}
]
}
} Output: Sum of all prices : 53.92
... View more
Labels:
- Labels:
-
Apache NiFi
09-04-2016
08:22 PM
@Sam Hjelmfelt So far I have not being able to find a feasible way for sending alerts through Nifi cluster, and am curious to know how should I deploy alerts in production Nifi cluster. Thanks
... View more
09-03-2016
09:10 AM
Thanks @Sam Hjelmfelt for your reply, Yes if the data lands on Primary node, PutEmail works as expected. However id the data lands on a Slave node, no Email is generated and flowfiles get stuck on connection for ever (i.e Slave nodes are not able to talk to primary node). Following is an example flow (template is attached, please check it out): In the below dataflow, we generate flow files and run MurgeContent (every 20 seconds) and then pass the result on to two PutEmail processors in parallel. First PutEmail is running on Primary, where as the second PutEmail processor is running on all Slave nodes (Timer event). For PutEmail on primary, it seems like for 9 generated files, only 1 got processed where as 8 got stuck on the connection (seems like slave nodes not able to talk to primary). Second PutEmail worked just fine i.e it processed all 9 flowfiles. So, is there a way to generate 1 Email alert if a processor fails in a cluster? PS: putemaillimitations.xml
... View more
09-02-2016
11:00 PM
2 Kudos
Hi, I am trying to use PutEmail in my workflow to send email alert whenever something fails. I have 8 slave nodes and my dataflow is running on all slaves (meaning not just primary node) The issue is that I get multiple emails if one processor has errors etc. I think this is because we have 8 slaves, so PutEmail is running on all 8 slaves and therefore I get multiple emails. - Is there a way to ensure that we always get 1 Email instead of 8? Thanks Obaid
... View more
Labels:
- Labels:
-
Apache NiFi