Member since
09-29-2015
871
Posts
720
Kudos Received
255
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1517 | 12-03-2018 02:26 PM | |
990 | 10-16-2018 01:37 PM | |
1795 | 10-03-2018 06:34 PM | |
1134 | 09-05-2018 07:44 PM | |
793 | 09-05-2018 07:31 PM |
06-28-2016
03:10 PM
4 Kudos
There is a GetHBase processor that is made to incrementally extract data from an HBase table by keeping track of the last timestamp seen, and finding cells where the timestamp is greater than the last time seen. There is an open JIRA to create another processor that might be called ScanHBase where it is not based on the timestamps and would allow more general extraction.
... View more
06-28-2016
02:49 PM
Both of the above answers are correct. Just to provide a full picture of what is happening... GetFile picks up the file from directory and brings it into NiFI's content_repository, which as milind pointed out is by default located under {nifi_install_dir}/content_repository. This directory is not meant to be used by the user, it is for NiFi's internal purposes. The FlowFile is then transferred to LogAttribute which logs information, and I assume if that is the end of your flow then you must have marked the success relationship on LogAttribute as auto-terminated. At this point the flow file is removed from NiFi and the content in the content repository will eventually be removed. NiFi is not meant to be a storage system where you bring data in and then leave it there, your flow would have to send the data somewhere after GetFile.
... View more
06-24-2016
01:20 PM
4 Kudos
Yes NiFi supports LDAP authentication. It is described in the admin guide here: https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#lightweight-directory-access-protocol-ldap
... View more
06-22-2016
01:23 PM
@Shishir Saxena Unfortunately I can't think of too many good work arounds using out of the box functionality, but I guess it depends what you are doing with the Avro after ExecuteSQL, are you landing it in HDFS? or are you converting it to some other format and doing something else?
... View more
06-20-2016
09:24 PM
4 Kudos
Hello, The GetHBase processor was created to be a source processor that incrementally extracts data from an HBase table, trying to use the timestamps of the cells to find new data. Unfortunately because of that it does not support incoming FlowFiles triggering it, which means it can't be used with HandleHttpRequest/Response. We had a similar request to what you would like to do and created this JIRA ticket: https://issues.apache.org/jira/browse/NIFI-1784 If we implemented that ScanHBase processor as described in the ticket, then you could have a flow like HandleHttpRequest -> ScanHBase -> HandleHttpResponse
-Bryan
... View more
06-20-2016
03:36 PM
2 Kudos
Looks like the Twitter API puts hash tags into the following JSON: "entities":
{ "hashtags":[],
"urls":[],
"user_mentions":[]
} You could use EvaluteJsonPath to extract the value of the hashtags into FlowFile attributes, and then use RouteOnAttribute to route the ones matching your tag to a PutFile processor. This blog shows an example of extracting values from the Twitter JSON and making routing decisions: https://blogs.apache.org/nifi/entry/indexing_tweets_with_nifi_and
... View more
06-17-2016
05:08 PM
1 Kudo
https://issues.apache.org/jira/browse/NIFI-2048
... View more
06-17-2016
05:00 PM
Actually, was just reading this: https://docs.oracle.com/cd/B19306_01/java.102/b14188/datamap.htm One of the rows says: DEC , DECIMAL , NUMBER , NUMERIC oracle.sql.NUMBER java.math.BigDecimal So if that is true then Oracle is returning BigDecimal for your NUMBER columns, and then NiFi's code is turning BigDecimal and BigInteger into Strings. Based on this, it looks like there may be a way to represent BigDecimal in Avro: https://issues.apache.org/jira/browse/AVRO-1402 https://avro.apache.org/docs/1.7.7/spec.html#Logical+Types We should look into using the decimal logical type in NiFi.
... View more
06-17-2016
04:42 PM
I'm wondering if it is related to the oracle driver... the code in NiFi that I linked to above shows how we convert the fields to Avro. It starts by get the value from the ResultSet like this: final Object value = rs.getObject(i); Then it checks what type value is, in the case of a number it does this: else if (value instanceof Number || value instanceof Boolean) {
rec.put(i - 1, value);
} So if the Oracle driver returns any sub-class of Number, then that is what is given to Avro. I think a good test would be to write a simple Java program using JDBC and the same driver you are using with NiFI, and query your table and see what getObject(i) returns from the ResultSet for your number columns. If it doesn't return Numbers then that is the problem, if it does then something else is happening after that.
... View more
06-16-2016
08:07 PM
3 Kudos
It should be retaining other types besides strings... What types are your columns that are being converted to strings? It looks like it should be retaining BINARY, VARBINARY, LONGVARBINARY, ARRAY, BLOB, CLOB, byte, number, and boolean. Big Integer and Big Decimal are converted to strings because Avro doesn't have a numeric type that supports them. For reference the code that performs the conversion is here: https://github.com/apache/nifi/blob/e4b7e47836edf47042973e604005058c28eed23b/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/util/JdbcCommon.java#L100
... View more
06-15-2016
02:23 PM
1 Kudo
NiFi processors have their own transaction concept called a session. Each processor starts a session, performs one or more operations, and then commits or rolls back the session. You can see the general idea in the AbstractProcessor that many processors extends from: https://github.com/apache/nifi/blob/master/nifi-api/src/main/java/org/apache/nifi/processor/AbstractProcessor.java
... View more
06-02-2016
03:18 PM
1 Kudo
Can you retry all these tests and during the second cat, instead of "cat 'record02' ", cat something longer like "cat 'record123456789'". I'd like to see if tracking the file size is the issue, because record01 and record02 would be the same file size.
... View more
05-25-2016
12:52 PM
This most likely means there is another JAR you need to add... If you look at the pom file for the hadoop-azure JAR: http://central.maven.org/maven2/org/apache/hadoop/hadoop-azure/2.7.0/hadoop-azure-2.7.0.pom You can see all the dependencies it needs: <dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>azure-storage</artifactId>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<scope>compile</scope>
</dependency>
My guess would be the azure-storage JAR is missing. This becomes a slippery slope though, because then azure-storage might have transitive dependencies as well.
... View more
05-25-2016
12:47 PM
1 Kudo
Generally all processors are executing within a single OS process started by a single user. The only case I can think of where one processor could execute at a higher level would be when using the ExecuteProcess/ExcuteStreamCommand processors... the command can be "sudo" and the args can be the command to execute. This assumes the user that started NiFi has sudo privileges.
... View more
05-24-2016
01:58 PM
2 Kudos
The DistributedMapCache is a NiFi concept which is used to store information for later retrieval, either by the current processor by another processor. There are two components - the DistributedMapCacheServer which runs on one node if you are in a cluster, and the DistributedMapCacheClientService which runs on all nodes if in a cluster, and communicates with the server. Both of these are Controller Services, configured in NiFi through the controller section in the top right toolbar. Processors use the client service to store and retrieve data from the cache server. In this case, DetectDuplicate uses the cache to store information about what it has seen and determine if it is a duplicate.
... View more
05-20-2016
05:05 PM
1 Kudo
I'm not totally sure if this is the problem, but given that NiFi has NARs with isolated class loading, adding something to the classpath usually isn't as simple as dropping it in the lib directory. The hadoop libraries NAR would be unpacked to this location: work/nar/extensions/nifi-hadoop-libraries-nar-<VERSION>.nar-unpacked/META-INF/bundled-dependencies/ You could trying putting the hadoop-azure.jar there, keeping in mind that if the work directory was removed, NiFi would unpack the original NAR again without your added jar. Some have had success creating a custom version of the hadoop libraries NAR to switch to other libraries: https://github.com/bbukacek/nifi-hadoop-libraries-bundle Right now Apache NiFi is based on Apache Hadoop 2.6.2.
... View more
05-19-2016
11:46 AM
I'm not 100% sure how LZO works, but in a lot of cases the codec ends up needing a native library. On a unix system you would set LD_LIBRARY_PATH to include the location of the .so files for the LZO codec, or put them in JAVA_HOME/jre/lib native directory. You could do something like: export LD_LIBRARY_PATH=/usr/hdp/2.2.0.0-1084/hadoop/lib/native
bin/nifi.sh start
That should let PutHDFS know about the appropriate libraries.
... View more
05-18-2016
08:11 PM
1 Kudo
Have you seen the Kerberos section of the NiFi admin guide? https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#kerberos_login_identity_provide
... View more
05-18-2016
01:19 PM
Can you try what Matt suggested above, to remove the "io.compression.codecs" from core-site.xml? I agree with him that this is likely related to the compression codecs, you can see in the stacktracke, the relevant lines are: org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2058) ~[na:na] at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:128) ~[na:na] at org.apache.hadoop.io.compress.CompressionCodecFactory.<init>(CompressionCodecFactory.java:175) ~[na:na] at org.apache.nifi.processors.hadoop.AbstractHadoopProcessor.getCompressionCodec(AbstractHadoopProcessor.java:375) ~[na:na] at org.apache.nifi.processors.hadoop.PutHDFS.onTrigger(PutHDFS.java:220) ~[na:na] at
... View more
05-18-2016
01:14 AM
1 Kudo
@bschofield Another idea for transferring large files over a high-latency network, might be the following... On the sending side use a SegmentContent processor to break a large FlowFile into many smaller segments, followed by a PostHTTP processor with the Concurrent Tasks increased higher than 1. This lets the sending side better utilize the network by concurrently sending segments. On the receiving side, use a ListenHTTP processor to received the segmented FlowFiles, followed by a MergeContent processor with a Merge Strategy of Defragment. The Defragment mode will merge all the segments back together to recreate the original FlowFile.
... View more
05-17-2016
05:01 PM
What version of NiFi is this?
... View more
05-12-2016
05:14 PM
8 Kudos
Processor's can be renamed on the first tab of the processor's configuration window. You could name each processor with the table name to easily see it when looking at the flow.
... View more
05-03-2016
05:11 PM
3 Kudos
HDF (NiFi) moves data around as flow files, and each flow file is made up of metadata attributes and content, where the content is just bytes. There is no internal data format where NiFi knows about "fields", so it can't provide a generic way to encrypt fields. It provides out of the box processors called EncryptContent and DecryptContent which can encrypt and decrypt the entire content of a FlowFile. If you have a know data format like CSV, JSON, etc, and want to encrypt individual fields with in that content, it would likely require a custom processor to interpret that format and apply the encryption. It may be possible to do this with the ExecuteScript processor, but a custom Java processor would definitely be possible.
... View more
04-29-2016
03:08 PM
12 Kudos
Apache NiFi provides several processors for receiving data
over a network connection, such as ListenTCP, ListenSyslog, and ListenUDP. These
processors are implemented using a common approach, and a shared framework for
receiving and processing messages. When one of these processors is scheduled to run, a separate
thread is started to listen for incoming network connections. When a connection
is accepted, another thread is created to read from that connection, we will
call this the channel reader (the original thread continues listening for
additional connections). The channel reader starts reading data from the connection
as fast as possible, placing each received message into a blocking queue. The
blocking queue is a shared reference with the processor. Each time the
processor executes it will poll the queue for one or more messages and write
them to a flow file. The overall setup looks like the following: There are several competing activities taking place:
The data producer is writing data to the socket,
which is being held in the socket’s buffer, while at the same time the channel
reader is reading the data held in that buffer. If the data is written faster
than it is read, then the buffer will eventually fill up, and incoming data
will be dropped. The channel reader is pushing messages into the
message queue, while the processor concurrently pulls messages off the queue.
If messages are pushed into the queue faster than the processor can pull them
out, then the queue will reach maximum capacity, causing the channel reader to
block while waiting for space in the queue. If the channel reader is blocking,
it is not reading from the connection, which means the socket buffer can start
filling up and cause data to be dropped. In order to achieve optimal performance, these processors
expose several properties to tune these competing activities:
Max Batch
Size – This property controls how many messages will be pulled from the
message queue and written to a single flow file during one execution of the
processor. The default value is 1, which means a message-per–flow-file. A
single message per flow file is useful for downstream parsing and routing, but
provides the worst performance scenario. Increasing the batch size will
drastically reduce the amount I/O operations performed, and will likely provide
the greatest overall performance improvement. Max Size
of Message Queue – This property controls the size of the blocking queue
used between the channel reader and the processor. In cases where large bursts of data are
expected, and enough memory is available to the JVM, this value can be
increased to allow more room for internal buffering. It is important to
consider that this queue is held in memory, and setting this size too large
could adversely affect the overall memory of the NiFi instance, thus causing
problems for the overall flow and not just this processor.
Max Size
of Socket Buffer – This property attempts to tell the operating system to
set the size of the socket buffer. Increasing this value provides additional
breathing room for bursts of data, or momentary pauses while reading data. In
some cases, configuration may need to be done at the operating system level in
order for the socket buffer size to take affect. The processor will provide a
warning if it was not able to set the buffer size to the desired value.
Concurrent
Tasks – This is a property on the scheduling tab of every processor. In
this case, increasing the concurrent tasks means more quickly pulling messages
off the message queue, and thus more quickly freeing up room for the channel
reader to push more messages in. When using a larger Max Batch Size, each
concurrent task is able to batch together that many messages in a single
execution. Adjusting the above values appropriately should
provide the ability to tune for high through put scenarios. A good approach
would be to start by increasing the Max Batch Size from 1 to 1000, and then
observe performance. From there, a slight increase to the Max Size of Socket
Buffer, and increasing Concurrent Tasks from 1 to 2, should provide additional
improvements.
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- hdf
- How-ToTutorial
- logs
- NiFi
- performance
Labels:
04-28-2016
03:04 PM
3 Kudos
It does not need to be on the source server, it is listening on the port in the processor for incoming connections. A syslog server can be configured to forward messages to the host & port that the processor is listening on. Once it accepts connections it will be reading as fast as possible. The main properties to consider customizing are the Max Batch Size, Max Size of Socket Buffer Size, and Max Size of Message Queue Size, but this depends a lot on your environment and the amount of data.
... View more
04-28-2016
02:18 AM
1 Kudo
I don't think anyone has done anything yet, but I've wanted to for a while. It should be straight forward to wrap the Jedis client in some processors.
... View more
04-20-2016
03:10 PM
@Anoop Nair adding FetchHBase to the nifi-hbase-bundle sounds like the right approach. The HBase processors rely on a Controller Service (HBaseClientService) to interact with HBase, which hides the HBase client from the processors. There is currently a scan method in the HBaseClientService: public void scan(final String tableName, final Collection<Column> columns, final String filterExpression, final long minTime, final ResultHandler handler)
throws IOException { Technically you could probably use this with a filter expression limiting to the row id you want, but it would be a lot more efficient to create a new method in that interface that takes the table, row id, and columns to return.
... View more
04-20-2016
02:27 PM
@Anoop Nair if you do get something working we would be happy to help you contribute it back to NiFi if you would like to, although perfectly fine to maintain it outside of NiFi as well. Let us know if we can help.
... View more
04-19-2016
06:37 PM
Or are you getting the row id, col fam, col qual coming from kafka, and you want to fetch a single cell?
... View more
04-19-2016
06:30 PM
I created this JIRA: https://issues.apache.org/jira/browse/NIFI-1784 How would you want the data for the row to be represented in a NiFi flow file? JSON document where the key/value pairs are column qualifiers / values?
... View more
- « Previous
- Next »