About anarasimham

anarasimham · ‎06-25-2018

You can use a regular expression to isolate the header line (for example search the entire content, start at the beginning using "^", and stop on the first newline) and replace it with what you'd like to have. You'l have to test out different regex strings to see what works for you but this should get you started.

anarasimham · ‎06-25-2018

@RAUI In Ambari, you can go to the NiFi service and select More Actions->Restart All. This will restart all the nodes of NiFi in your cluster at the same time.

anarasimham · ‎06-21-2018

I think you've seen this blog post already but just in case you haven't: https://bryanbende.com/development/2016/09/15/apache-nifi-and-apache-kafka You'll want to understand whether the bottleneck is on the Kafka side or on the NiFi side so you can understand where to appropriately tune. How many NiFi nodes do you have? How many Kafka nodes? How many partitions for your Kafka topic? The blog post above goes into detail on how to match partitions with NiFi nodes and concurrent tasks as well.

anarasimham · ‎06-21-2018

How many rows of data do you have? Can you test this issue out with a new set of tables with a few rows in them? Can you try removing the rest of the columns to remove any extra variables?

anarasimham · ‎06-20-2018

What datatype is mdse_item_i originally? Can you paste the output here for when you get the 7 distinct values versus the 10 distinct values? I'd like to see what the difference is.

anarasimham · ‎06-20-2018

Is your data on HDFS? If so, you would use the GetHDFS processor to load your file into a FlowFile. If your data is on your local NiFi node, then you would use a GetFile processor to load the file. Next if you want to split by newline, you could use SplitText processor to split your file into multiple FlowFiles. If you only want to split by your '#@' and '#$' you can use the SplitContent processor. That processor will split based on a sequence of text characters (set the 'Byte Sequence Format' to 'text') so you can put in '#@' to split on. I'm not sure exactly how you'd like to divide your data but that should give you a starting point. You can chain multiple of these SplitContent processors together to split on multiple character sequences. Ultimately, your one file on disk will be converted into multiple FlowFiles in NiFi. Take a look at the SplitContent processor for more info: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache.nifi.processors.standard.SplitContent/index.html

anarasimham · ‎06-19-2018

Here it is off of the NiFi Git repository - the code as well as everything that makes the project is open source so feel free to use it. https://git-wip-us.apache.org/repos/asf?p=nifi-site.git;a=tree;f=src/images;h=4319258d1204c08c31497c4494f46ddfd0a09e2f;hb=HEAD

anarasimham · ‎06-19-2018

Yes, if what you are asking is to add an extra piece of data to a NiFi FlowFile then you can do that. What I am not sure of is the format of the data in your FlowFile - is it JSON, CSV, something else? If it is a human-readable format, you can use the ReplaceText processor to add more data into your FlowFile content. You'll need to modify your destination table schema and add another column to it assuming you're using Hive to read the data. The ReplaceText processor accepts statements in NiFi expression language so you'll want to read up on that to find out how to best find your string location and then insert text into it. https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html

anarasimham · ‎06-18-2018

From the 'top' documentation: %CPU -- CPU Usage : The percentage of your CPU that is being used by the process. By default, top displays this as a percentage of a single CPU. On multi-core systems, you can have percentages that are greater than 100%. For example, if 3 cores are at 60% use, top will show a CPU use of 180%. Yes, MapReduce utilizes more than one core on your machine - it is parallelized at the node level as well as at the process level to take advantage of as many cores as possible. The processing of each row of data is independent of all other rows of data so that the data can be split up in as many ways as you have processing capabilities.

anarasimham · ‎06-18-2018

It looks like the application you've written uses almost 500 MB of driver memory. It sounds like your goal is to utilize all the CPU that your nodes carry - you'll have to either change the way your application works (to reduce the driver RAM) or reduce the executor memory to use all of the threads that your cluster offers.

Online	Offline
Last Visited	‎01-16-2019 11:09 AM

Member Since	‎01-14-2019 08:10 AM
Last Visited	‎01-16-2019 11:09 AM
Posts	144
Kudos received	48

Cloudera Community

Re: MAPREDUE code VS sequential code?

Re: How can i get “Hortonworks HDPCA_x.x PracticeE...

Re: Services report

Re: Best way to restart the NiFi cluster

Re: Divide file in nifi

Re: ReplaceText Processor edit only the header of ...

Re: Best way to restart the NiFi cluster

Re: nifi - publishing millions of 100 bytes messag...

Re: Count Distinct discrepancy --Hive

Re: Count Distinct discrepancy --Hive

Re: Divide file in nifi

Re: Public NiFi Icons

Re: Adding columns to sql in nifi

Re: Why does one container use more than 100% cpu?

Re: worker uses more ram than it should