About arunak

MattWho · ‎02-07-2018

@Felix Albani Thank you for your feedback... I have made the correction.

arunak · ‎05-11-2017

Thanks @Matt Burgess. Wanted to be sure if "replace" on the template was a dirty fix.

shetty_vikas · ‎07-31-2019

@amcbarnett : i am trying to aggregate a data using "state,count( distinct val ) group by state " but want just the "Not Null" Val - String datatype

arunak · ‎02-21-2017

Thanks @Andy LoPresto. This helps.

TimothySpann · ‎02-13-2017

For IO The throughput or latency one can expect to see varies greatly, depending on how the system is configured. Given that there are pluggable approaches to most of the major NiFi subsystems, performance depends on the implementation. But, for something concrete and broadly applicable, consider the out-of-the-box default implementations. These are all persistent with guaranteed delivery and do so using local disk. So being conservative, assume roughly 50 MB per second read/write rate on modest disks or RAID volumes within a typical server. NiFi for a large class of dataflows then should be able to efficiently reach 100 MB per second or more of throughput. That is because linear growth is expected for each physical partition and content repository added to NiFi. This will bottleneck at some point on the FlowFile repository and provenance repository. We plan to provide a benchmarking and performance test template to include in the build, which allows users to easily test their system and to identify where bottlenecks are, and at which point they might become a factor. This template should also make it easy for system administrators to make changes and to verify the impact. For CPU The Flow Controller acts as the engine dictating when a particular processor is given a thread to execute. Processors are written to return the thread as soon as they are done executing a task. The Flow Controller can be given a configuration value indicating available threads for the various thread pools it maintains. The ideal number of threads to use depends on the host system resources in terms of numbers of cores, whether that system is running other services as well, and the nature of the processing in the flow. For typical IO-heavy flows, it is reasonable to make many dozens of threads to be available. For RAM NiFi lives within the JVM and is thus limited to the memory space it is afforded by the JVM. JVM garbage collection becomes a very important factor to both restricting the total practical heap size, as well as optimizing how well the application runs over time. NiFi jobs can be I/O intensive when reading the same content regularly. Configure a large enough disk to optimize performance. See: https://community.hortonworks.com/questions/22685/capacity-planning-for-nifi-cluster.html See: https://community.hortonworks.com/questions/4098/nifi-sizing.html https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#configuration-best-practices https://community.hortonworks.com/content/kbentry/7882/hdfnifi-best-practices-for-setting-up-a-high-perfo.html https://community.hortonworks.com/content/kbentry/9785/nifihdf-dataflow-optimization-part-2-of-2.html See: https://community.hortonworks.com/content/kbentry/9785/nifihdf-dataflow-optimization-part-2-of-2.html http://apache-nifi.1125220.n5.nabble.com/Nifi-Benchmark-Performance-tests-td1099.html http://docs.hortonworks.com/HDPDocuments/HDF2/HDF-2.1.1/bk_dataflow-overview/content/performance-expectations-and-characteristics-of-nifi.html

arunak · ‎10-18-2018

It was an access issue on the Buckets. Right permission settings on the bucket fixed it.

arunak · ‎01-30-2017

@Bryan Bende : Thanks for pointing the Jira.

aervits · ‎02-01-2017

@Vaibhav Kumar recommendations from my colleagues are valid, you have strings in header row of your CSV documents. You can certainly filter by some known entity but there's a more advanced version of CSV Pig Loader called CSVExcelStorage. It is part of Piggybank library that comes bundled with HDP, hence the register command. You can pass different control parameters to it. Mortar blog is an excellent source of information on working with Pig http://help.mortardata.com/technologies/pig/csv. grunt> register /usr/hdp/current/pig-client/piggybank.jar; grunt> a = load 'BJsales.csv' using org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') as (Num:Int,time:int,BJsales:float); grunt> describe a; a: {Num: int,time: int,BJsales: float} grunt> b = limit a 5; grunt> dump b; output (1,1,200.1) (2,2,199.5) (3,3,199.4) (4,4,198.9) (5,5,199.0) notice I am not filtering any relation, I'm telling the loader to skip header outright, it saves a few key strokes and doesn't waste any cycles processing anything extra.

arunak · ‎12-15-2016

Thanks @Karthik Narayanan. I was able to resolve the issue. Before diving into the solutions, I should make the below statement - With NiFi 1.0 and 1.1, LZO compression cannot be achieved using the PutHDFS processor. The only supported compressions are the ones listed in the compression codec drop down. With the LZO related classes being present in the core-site.xml, the NiFi processor fails to run. The suggestion from the previous HCC post was to remove those classes. It needed to be retained so that NiFi's copy and HDP's copy of core-site are always in sync. NiFi 1.0 I created the hadoop-lzo jar by building it from sources and added the same to the NiFi lib directory and restarted NiFi. This resolved the issue and I am able to proceed using the PutHDFS without it erroring out. NiFi 1.1 Configure the processor's additional classpath to the jar file. No restart required. Note : This does not provide LZO compression, it just can run the processor without ERROR even when you have the LZO classes in the core site. UNSATISFIED LINK ERROR WITH SNAPPY I also had issue with Snappy Compression codec in NiFi. Was able to resolve it setting the path to the .so file. This did not work on the ambari-vagrant boxes, but I was able to get this working on an openstack cloud instance. The issue on the virtual box could be systemic. To resolve the link error, I copied the .so files from HDP cluster and recreated the links. And as @Karthik Narayanan suggested, added the java library path to the directory containing the .so files. Below is the list of .so and links And below is the bootstrap configuration change

arunak · ‎11-18-2016

Thanks @Matt Burgess. Currently I am handling this using a Javascript, similar approach to what you described. I wanted to confirm there is no other way. For simpler structures, I managed to extract the key values using Regex, but for deep nested keys, I was forced to use ExecuteScript.

Online	Offline
Last Visited	‎01-10-2020 08:56 AM

Member Since	‎05-17-2016 11:59 AM
Last Visited	‎01-10-2020 08:56 AM
Posts	190
Kudos received	46

Cloudera Community

Re: Composed delimiter , multidilimiter in Hive !!...

Re: How to put running log of Apahce NiFi into Spl...

Re: How to extract Text from JSON

Re: How to expand a single row with a start and en...

Re: Enabling LZO compression using NiFi PutHDFS

Re: NiFi Ranger based policy descriptions

Re: Hive DHCP : Using external property file

Re: HIVE : counting null values based on group by

Re: Accessing processor properties in Execute Scri...

Re: NiFi Sizing, Benchmark Conditions and Number o...

Re: Issue Configuring ListS3

Re: Specifying Datatype to JSON Key-Values while d...

Re: Pig Error : ERROR org.apache.pig.tools.grunt.G...

Re: Enabling LZO compression using NiFi PutHDFS

Re: Problem in JSON result with QueryCassandra pr...