Member since
09-29-2015
871
Posts
723
Kudos Received
255
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3348 | 12-03-2018 02:26 PM | |
2302 | 10-16-2018 01:37 PM | |
3615 | 10-03-2018 06:34 PM | |
2392 | 09-05-2018 07:44 PM | |
1814 | 09-05-2018 07:31 PM |
05-12-2016
05:14 PM
8 Kudos
Processor's can be renamed on the first tab of the processor's configuration window. You could name each processor with the table name to easily see it when looking at the flow.
... View more
05-03-2016
05:11 PM
3 Kudos
HDF (NiFi) moves data around as flow files, and each flow file is made up of metadata attributes and content, where the content is just bytes. There is no internal data format where NiFi knows about "fields", so it can't provide a generic way to encrypt fields. It provides out of the box processors called EncryptContent and DecryptContent which can encrypt and decrypt the entire content of a FlowFile. If you have a know data format like CSV, JSON, etc, and want to encrypt individual fields with in that content, it would likely require a custom processor to interpret that format and apply the encryption. It may be possible to do this with the ExecuteScript processor, but a custom Java processor would definitely be possible.
... View more
04-29-2016
03:08 PM
12 Kudos
Apache NiFi provides several processors for receiving data
over a network connection, such as ListenTCP, ListenSyslog, and ListenUDP. These
processors are implemented using a common approach, and a shared framework for
receiving and processing messages. When one of these processors is scheduled to run, a separate
thread is started to listen for incoming network connections. When a connection
is accepted, another thread is created to read from that connection, we will
call this the channel reader (the original thread continues listening for
additional connections). The channel reader starts reading data from the connection
as fast as possible, placing each received message into a blocking queue. The
blocking queue is a shared reference with the processor. Each time the
processor executes it will poll the queue for one or more messages and write
them to a flow file. The overall setup looks like the following: There are several competing activities taking place:
The data producer is writing data to the socket,
which is being held in the socket’s buffer, while at the same time the channel
reader is reading the data held in that buffer. If the data is written faster
than it is read, then the buffer will eventually fill up, and incoming data
will be dropped. The channel reader is pushing messages into the
message queue, while the processor concurrently pulls messages off the queue.
If messages are pushed into the queue faster than the processor can pull them
out, then the queue will reach maximum capacity, causing the channel reader to
block while waiting for space in the queue. If the channel reader is blocking,
it is not reading from the connection, which means the socket buffer can start
filling up and cause data to be dropped. In order to achieve optimal performance, these processors
expose several properties to tune these competing activities:
Max Batch
Size – This property controls how many messages will be pulled from the
message queue and written to a single flow file during one execution of the
processor. The default value is 1, which means a message-per–flow-file. A
single message per flow file is useful for downstream parsing and routing, but
provides the worst performance scenario. Increasing the batch size will
drastically reduce the amount I/O operations performed, and will likely provide
the greatest overall performance improvement. Max Size
of Message Queue – This property controls the size of the blocking queue
used between the channel reader and the processor. In cases where large bursts of data are
expected, and enough memory is available to the JVM, this value can be
increased to allow more room for internal buffering. It is important to
consider that this queue is held in memory, and setting this size too large
could adversely affect the overall memory of the NiFi instance, thus causing
problems for the overall flow and not just this processor.
Max Size
of Socket Buffer – This property attempts to tell the operating system to
set the size of the socket buffer. Increasing this value provides additional
breathing room for bursts of data, or momentary pauses while reading data. In
some cases, configuration may need to be done at the operating system level in
order for the socket buffer size to take affect. The processor will provide a
warning if it was not able to set the buffer size to the desired value.
Concurrent
Tasks – This is a property on the scheduling tab of every processor. In
this case, increasing the concurrent tasks means more quickly pulling messages
off the message queue, and thus more quickly freeing up room for the channel
reader to push more messages in. When using a larger Max Batch Size, each
concurrent task is able to batch together that many messages in a single
execution. Adjusting the above values appropriately should
provide the ability to tune for high through put scenarios. A good approach
would be to start by increasing the Max Batch Size from 1 to 1000, and then
observe performance. From there, a slight increase to the Max Size of Socket
Buffer, and increasing Concurrent Tasks from 1 to 2, should provide additional
improvements.
... View more
Labels:
04-28-2016
03:04 PM
3 Kudos
It does not need to be on the source server, it is listening on the port in the processor for incoming connections. A syslog server can be configured to forward messages to the host & port that the processor is listening on. Once it accepts connections it will be reading as fast as possible. The main properties to consider customizing are the Max Batch Size, Max Size of Socket Buffer Size, and Max Size of Message Queue Size, but this depends a lot on your environment and the amount of data.
... View more
04-28-2016
02:18 AM
1 Kudo
I don't think anyone has done anything yet, but I've wanted to for a while. It should be straight forward to wrap the Jedis client in some processors.
... View more
04-20-2016
03:10 PM
@Anoop Nair adding FetchHBase to the nifi-hbase-bundle sounds like the right approach. The HBase processors rely on a Controller Service (HBaseClientService) to interact with HBase, which hides the HBase client from the processors. There is currently a scan method in the HBaseClientService: public void scan(final String tableName, final Collection<Column> columns, final String filterExpression, final long minTime, final ResultHandler handler)
throws IOException { Technically you could probably use this with a filter expression limiting to the row id you want, but it would be a lot more efficient to create a new method in that interface that takes the table, row id, and columns to return.
... View more
04-20-2016
02:27 PM
@Anoop Nair if you do get something working we would be happy to help you contribute it back to NiFi if you would like to, although perfectly fine to maintain it outside of NiFi as well. Let us know if we can help.
... View more
04-19-2016
06:37 PM
Or are you getting the row id, col fam, col qual coming from kafka, and you want to fetch a single cell?
... View more
04-19-2016
06:30 PM
I created this JIRA: https://issues.apache.org/jira/browse/NIFI-1784 How would you want the data for the row to be represented in a NiFi flow file? JSON document where the key/value pairs are column qualifiers / values?
... View more
04-18-2016
06:54 PM
1 Kudo
There is a GetHBase processor described here: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.hbase.GetHBase/index.html It is intended to do an incremental retrieval from an HBase table. Does this work for your use-case, or are you looking for something different?
... View more