Member since
09-25-2015
112
Posts
37
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2036 | 12-21-2016 09:31 AM |
03-30-2024
06:56 PM
1 Kudo
If you're looking for more of an integration testing flow testing I have built a framework that empowers flow developers do flow testing without writing single line of code by deploying target flow on Nifi Server running in docker container, run tests, generate test report and delete the test flow. Medium Article: https://medium.com/@dharmachand/comprehensive-guide-to-nifi-flow-testing-with-nipyapi-e44a61975be9 GitHub Repo: https://github.com/dharmachand/nifi-testing Regards, Chand
... View more
02-23-2018
07:52 AM
Hey, I am pretty much confused which storage format is suited for which type of data. You said "Parquet is well suited for data warehouse kind of solutions where aggregations are required on certain column over a huge set of data.", But I think its true for ORC too. And As @owen said, ORC contains indexes at 3 levels (2 levels in parquet), shouldn't ORC be faster than Parquet for aggregations.
... View more
01-22-2019
04:27 PM
1 Kudo
This is old question, but just thought of replying you can do df.groupBY().pivot("pivotcolname).agg(...) Notice that groypBy clause is empty
... View more
02-13-2017
07:21 PM
1 Kudo
There's also the documentation here: https://hortonworks.github.io/hdp-aws/s3-spark/
... View more
12-21-2016
09:31 AM
@Mridul M Thanks for the reply. Actually i was using the old version of spark testing base. Didn't know there was a 2.0.2 version. Holden karau pointed me to maven central where i found the latest version of spark testing base. In github readme of spark testing base it was mentioned 1.6 so i assumed the latest version was 1.6. But now its sorted and spark 2.0.2 Dataframe testing works for me. In case people might need this it also needs Hive dependency. Currently its new and couldnt find forums in spark 2.0 testing so iam posting this might help save time for other developers 🙂
... View more
12-08-2016
11:02 AM
If it solves your question, please mark it as accepted so that it shows es resolved in the overview. Thanks
... View more
08-31-2016
12:44 PM
2 Kudos
The NiFi UI has a Summary page from the top-right menu, as well as stats on each processor by right-clicking and selecting Status History. Both of those views show things like bytes read/written, flow files in/out, etc. If you want to track this information somewhere else, you can implement a custom ReportingTask to send this information somewhere. This is how Ambari is able to display the stats/graphs, there is an AmbariReportingTask: https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-ambari-bundle/nifi-ambari-reporting-task/src/main/java/org/apache/nifi/reporting/ambari/AmbariReportingTask.java
... View more
02-23-2017
05:57 AM
@Bryan Thanks Bryn. I will try this and will let you know if this works or not.
... View more
08-16-2016
01:44 PM
3 Kudos
I think it depends what you mean by "schedule it to run every hour"... NiFi itself would always be running and different processors can be scheduled to run according to their needs. Every processor supports timer based scheduling or cron based scheduling, so using either of those you can set a source processor to run every hour. You could also use the REST API to start and stop processors as needed, anything you can do in the UI can be done through the REST API. For best practices for upgrading NiFi see this wiki page: https://cwiki.apache.org/confluence/display/NIFI/Upgrading+NiFi Deploying changes to production, there are a couple of approaches, one of them is based around templates: https://github.com/aperepel/nifi-api-deploy Some people also just move the flow.xml.gz from one environment to another, but this assumes you have parametized everything that is different between environments.
... View more
08-16-2016
01:59 PM
2 Kudos
In general, concurrent tasks is the number of threads calling onTrigger for an instance of a processors. In a cluster, if you set concurrent tasks to 4, then it is 4 threads on each node of your cluster. I am not as familiar with all the ins and outs of the kafka processors, but for GetKafka it does something like this: int concurrentTaskToUse = context.getMaxConcurrentTasks();
final Map<String, Integer> topicCountMap = new HashMap<>(1);
topicCountMap.put(topic, concurrentTaskToUse);
final Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap); The consumer is from the Kafka 0.8 client, so it is creating a message-stream for each concurrent task. Then when the processor is triggered it takes one of those message-streams and consumes a message, and since multiple concurrent tasks are trigger the processor, it is consuming from each of those streams in parallel. As far as rebalancing, I think the Kafka client does that transparently to NiFi, but I am not totally sure. Messages that have already been pulled into a NiFi node will stay there until the node is back up and processing.
... View more