About bigspark

dharmachand · ‎03-30-2024

If you're looking for more of an integration testing flow testing I have built a framework that empowers flow developers do flow testing without writing single line of code by deploying target flow on Nifi Server running in docker container, run tests, generate test report and delete the test flow. Medium Article: https://medium.com/@dharmachand/comprehensive-guide-to-nifi-flow-testing-with-nipyapi-e44a61975be9 GitHub Repo: https://github.com/dharmachand/nifi-testing Regards, Chand

Hadoopy · ‎02-23-2018

Hey, I am pretty much confused which storage format is suited for which type of data. You said "Parquet is well suited for data warehouse kind of solutions where aggregations are required on certain column over a huge set of data.", But I think its true for ORC too. And As @owen said, ORC contains indexes at 3 levels (2 levels in parquet), shouldn't ORC be faster than Parquet for aggregations.

devquestions2 · ‎01-22-2019

This is old question, but just thought of replying you can do df.groupBY().pivot("pivotcolname).agg(...) Notice that groypBy clause is empty

stevel · ‎02-13-2017

There's also the documentation here: https://hortonworks.github.io/hdp-aws/s3-spark/

bigspark · ‎12-21-2016

@Mridul M Thanks for the reply. Actually i was using the old version of spark testing base. Didn't know there was a 2.0.2 version. Holden karau pointed me to maven central where i found the latest version of spark testing base. In github readme of spark testing base it was mentioned 1.6 so i assumed the latest version was 1.6. But now its sorted and spark 2.0.2 Dataframe testing works for me. In case people might need this it also needs Hive dependency. Currently its new and couldnt find forums in spark 2.0 testing so iam posting this might help save time for other developers 🙂

bwalter1 · ‎12-08-2016

If it solves your question, please mark it as accepted so that it shows es resolved in the overview. Thanks

bbende · ‎08-31-2016

The NiFi UI has a Summary page from the top-right menu, as well as stats on each processor by right-clicking and selecting Status History. Both of those views show things like bytes read/written, flow files in/out, etc. If you want to track this information somewhere else, you can implement a custom ReportingTask to send this information somewhere. This is how Ambari is able to display the stats/graphs, there is an AmbariReportingTask: https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-ambari-bundle/nifi-ambari-reporting-task/src/main/java/org/apache/nifi/reporting/ambari/AmbariReportingTask.java

pradhumankumar_ · ‎02-23-2017

@Bryan Thanks Bryn. I will try this and will let you know if this works or not.

bbende · ‎08-16-2016

I think it depends what you mean by "schedule it to run every hour"... NiFi itself would always be running and different processors can be scheduled to run according to their needs. Every processor supports timer based scheduling or cron based scheduling, so using either of those you can set a source processor to run every hour. You could also use the REST API to start and stop processors as needed, anything you can do in the UI can be done through the REST API. For best practices for upgrading NiFi see this wiki page: https://cwiki.apache.org/confluence/display/NIFI/Upgrading+NiFi Deploying changes to production, there are a couple of approaches, one of them is based around templates: https://github.com/aperepel/nifi-api-deploy Some people also just move the flow.xml.gz from one environment to another, but this assumes you have parametized everything that is different between environments.

bbende · ‎08-16-2016

In general, concurrent tasks is the number of threads calling onTrigger for an instance of a processors. In a cluster, if you set concurrent tasks to 4, then it is 4 threads on each node of your cluster. I am not as familiar with all the ins and outs of the kafka processors, but for GetKafka it does something like this: int concurrentTaskToUse = context.getMaxConcurrentTasks(); final Map<String, Integer> topicCountMap = new HashMap<>(1); topicCountMap.put(topic, concurrentTaskToUse); final Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap); The consumer is from the Kafka 0.8 client, so it is creating a message-stream for each concurrent task. Then when the processor is triggered it takes one of those message-streams and consumes a message, and since multiple concurrent tasks are trigger the processor, it is consuming from each of those streams in parallel. As far as rebalancing, I think the Kafka client does that transparently to NiFi, but I am not totally sure. Messages that have already been pulled into a NiFi node will stay there until the node is back up and processing.

Online	Offline
Last Visited	‎10-19-2020 08:16 AM

Member Since	‎09-25-2015 09:05 AM
Last Visited	‎10-19-2020 08:16 AM
Posts	112
Kudos received	37

Cloudera Community

Re: Spark Testing base 1.6.1_0.3.3 for spark versi...

Re: Testing Nifi Unit & Integration Testing & auto...

Re: Between Avro, Parquet, and RC/ORC which is use...

Re: Performing Spark pivot without aggregation?

Re: spark 2.1.0 Reading *.gz files from an s3 buck...

Re: Spark Testing base 1.6.1_0.3.3 for spark versi...

Re: Spark SQL 2.0 - performance of Plain SQL query...

Re: Monitoring Nifi Throughput apart from UI any l...

Re: Nifi 0.7 putsplunk processor to send log files...

Re: How to Productionize Nifi and schedule it to r...

Re: In nifi getkafka processor we can set concurre...