Member since
09-29-2015
871
Posts
721
Kudos Received
255
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1772 | 12-03-2018 02:26 PM | |
1187 | 10-16-2018 01:37 PM | |
2207 | 10-03-2018 06:34 PM | |
1326 | 09-05-2018 07:44 PM | |
970 | 09-05-2018 07:31 PM |
12-07-2018
04:08 PM
When you are running in a cluster, there should be a clustered state manager defined in conf/state-management.xml which likely be the ZooKeeper cluster state manager. If you are using a correctly setup ZooKeeper quorum, typically 3 nodes, then you really shouldn't need to back up anything since the information would be replicated on all 3 ZK instances.
... View more
12-05-2018
09:46 PM
Is there any pattern you can find to reproduce this? If you restart NiFi Registry after this happens, is it working ok again, or does it have an issue on start up?
... View more
12-03-2018
02:26 PM
4 Kudos
NiFi Registry is a separate application and works the same whether NiFi is clustered or not. You can run NiFi Registry on one of the NiFi nodes or on a completely separate node, either way you just tell your NiFi cluster about the locaiton of NiFi Registry. It works the same whethere NiFi is clustered or standalone.
... View more
11-29-2018
06:12 PM
Thanks for the insight, that makes total sense. Now that you mentioned provenance, I'm now wondering if NiFi should automatically be putting the bundle info of the component into the provenance events, since that is important information to know when looking at the history.
... View more
11-29-2018
05:33 PM
If there is a good reason for a processor to know the bundle information, then I think we can provide it through one of the context APIs. I'm curious to know what the use case is for the processor needing to know this, not saying we shouldn't add it, just want to understand.
... View more
11-27-2018
05:22 PM
1 Kudo
The processor has a pool of consumers that is created when the processor is started so you would have to use the REST API to stop the processor, change the value of the topic names property, and then start the processor again.
... View more
11-01-2018
06:00 PM
FetchHDFS is to fetch a file, and PutHDFS is to write a file... so if there is a directory with no files then there is nothing to fetch or put.
... View more
10-17-2018
01:45 PM
1 Kudo
Your whole flow will perform better if you have flow files with many records, rather than 1 record per flow file. PutDatabaseRecord will set auto-commit to false on the connection, then start executing the statements, and if any failure happens it will call rollback on the connection.
... View more
10-16-2018
02:37 PM
Yes it is up to the processor to decide what to log, the framework doesn't really know what the processor is doing. In this case, the stuff that you would see in the log would be all the calls to the logger here: https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/GetHTTP.java#L492-L553 I suppose some logging statements could be added around the call to client.execute() but I'm not sure how you'd get more info about what the execute call is doing, you would just know it started or completed.
... View more
10-16-2018
01:43 PM
When you create a topic there are two different concepts - partitions and replication. If you have 3 brokers and create a topic with 1 partition, then the entire topic exists only on one of those brokers. If you create a topic with 3 paritions then 1/3 of the topic is on broker 1 as partition 1, 1/3 on broker 2 as partition 2, and 1/3 on broker 3 as partition 3. If you create a topic with 3 partitions AND replicaiton factor of 2, then its same as above except there is also a copy of each partition on another node. So parition 1 may be on broker 1 with a copy on broker 2, parition 2 maybe be on broker 2 with a copy on broker 3, and partition 3 may be on broker 3 with a copy on broker 1. In general, replication ensures that if a broker goes down then another broker still has the data, and partition allows for higher read/write throughput by dividing up the data across multiple nodes.
... View more
10-16-2018
01:37 PM
Once a processor is started it is then running according to its configured schedule. For example, Run Schedule of 5 seconds means it is executed every 5 seconds. Depending on the processor, each execution may produce one or more flow files that are transferred to a relationship, or it may produce an error which is reported by a bulletin which shows a red error in the corner of the processor. So generally you should either be seeing flow files being transferred to one of the relationships, or bulletins.
... View more
10-03-2018
06:34 PM
2 Kudos
When you start a processor it is considered "scheduled" which means it then executes according to the scheduling strategy. The two scheduling strategies are "Timer Driven" like "Run every 5 mins" or CRON driven which executes according to a CRON expression. You can also use Timer Driven with a very low Run Schedule like 1 second, and then use the REST API to start and stop the processor when you need to.
... View more
09-26-2018
02:57 PM
It only supports expression language from system properties and environment variables, but not from incoming flow files. This is because when the processor is started it will initialize a pool of UDP clients, so the port and host must be known before hand. It would probably be possible to make the UDP case more dynamic, but part of this issue is that PutUDP and PutTCP share a significant amount of underlying code.
... View more
09-06-2018
01:01 PM
I think in the controller service for Hortonworks Schema Registry the URL needs to be http://localhost:7788/api/v1 where as in your screenshot it is missing the http:// Also if you can upgrade to HDF 3.2 there is an option in the writers to inherit the schema from the reader, so the writer doesn't even need to be configured with the schema registry, only the reader would need it.
... View more
09-05-2018
07:44 PM
The configuration of the reader and writer is not totally correct... you have selected the strategy as "Schema Text" which means the schema will come from the value of the "Schema Text" property which you then have set to the default value of ${avro.schema}, and then in UpdateAttribute you set avro.schema to "AllItems" so it is trying to parse the string "AllItems" into an Avro schema and failing because it is not a JSON schema. If you want to use the "Schema Text" strategy then in UpdateAttribute the value of avro.schema needs to be the full text of your schema that you showed in your post. If you want to use the schema from HWX schema registry, then the access strategy needs to be "Schema Name" and the "Schema Name" property needs to reference the name of the schema in the HWX schema registry (this part you already have setup correctly, so just changing the strategy should work).
... View more
09-05-2018
07:31 PM
1 Kudo
I answered this on stackoverflow: https://stackoverflow.com/questions/52188619/mergecontent-processor-is-not-giving-expected-result
... View more
08-27-2018
05:29 PM
3 Kudos
This behavior is currently how it is designed to work. The presence of variables is captured with their initial value at the time, but then changes to the variable values do not trigger changes to be committed. The idea is that you create your flow in one environment with some number of variables, then promote it to another environment and change the variable values to be whatever is needed for the new environment, which should not require anything to be committed back to registry.
... View more
08-14-2018
02:33 PM
The parquet data itself has the schema, and your writer should be configured with schema access strategy to inherit schema from reader. Schema Access Strategyinherit-record-schema
Use 'Schema Name' Property Inherit Record Schema Use 'Schema Text' Property This will produce a flow file with many records. If you need 1 record per flow file then you would use SplitRecord after this, however generally it is better to keep many records together.
... View more
08-14-2018
01:38 PM
1 Kudo
I've responded on stackoverflow... https://stackoverflow.com/questions/51835130/how-exactly-apache-nifi-consumekafka-1-0-processor-works
... View more
08-14-2018
01:37 PM
FetchParquet has a property for a record writer... when it fetches the parquet, it will read it record by record using Parquet's Avro reader, and then pass each record to the configured writer. So if you configured it with a JSON record writer, then the resulting flow file that is fetched will contain JSON. If you wanted to fetch raw parquet then you wouldn't use FetchParquet, but would instead just use FetchHDFS which fetches bytes unmodified.
... View more
08-07-2018
01:37 PM
1 Kudo
https://pierrevillard.com/2017/01/24/integration-of-nifi-with-ldap/ https://ijokarumawak.github.io/nifi/2016/11/15/nifi-auth/ https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#ldap_login_identity_provider
... View more
08-01-2018
01:24 PM
1 Kudo
How do you plan to determine the schema from your json? are you saying you want to infer a schema based on the data? Typically this approach doesn't work that great because it is hard to guess the correct type for a given field. Imagine the first record has a field "id" and the value is "1234" so it looks like it is a number, but the second record has id as "abcd", so if it guesses a number based on the first record then it will fail on the second record because its not a number. There is a processor that attempts to do this though, InferAvroSchema... you could probably do something like InferAvroSchema -> ConvertJsonToAvro -> PutParquet with Avro Reader.
... View more
07-31-2018
04:43 PM
The response of the POST should be the process group entity with the id populated, and in addition there should be a header that has the URI of the created process group.
... View more
07-26-2018
01:26 PM
Just wanted to add some more info... The Parquet Java API only allows reading and writing to and from Hadoop's Filesystem API, this is why NiFi currently can't provide a standard record reader and writer because those require reading and writing to Java's InputStream and OutputStream, which Parquet doesn't provide. So PutParquet can be configured with a record reader to handle any incoming data, and then converts it to Parquet and writes to HDFS, basically it has a record writer encapsulated in it that can only write to HDFS. FetchParquet does the reverse where it can read Parquet files from HDFS and then can be configured with a record writer to write them out as any form, in your case CSV. You can always create core-site.xml with a local filesystem to trick the Parquet processors into using local disk instead of HDFS.
... View more
07-03-2018
05:07 PM
If ListFTP is showing an error in the UI then that error message is in nifi-app.log somewhere, please provide the full stacktrace that goes with that error.
... View more
07-03-2018
04:50 PM
Custom components have access to an API, not to the implementations. So a processor has access to ProcessSession which is the API, but not StandardProcessSession which is the implementation. This is done on purpose so that the NiFi framework can evolve behind the scens as long as it adheres to the API which is the contract with the processors/components. The only way to get the information you are looking for would be to have an upstream processor add attributes to the flow file containing the relevant information, and then use session.getAttribute(...) to get the attributes you are interested in.
... View more
07-02-2018
04:52 PM
When you say it is "not working", what exactly is happening? There can only really be three possible outcomes... a) it fetched successfully b) it did not fetch successfully and there is a red error on the processor in the UI, and an error in nifi-app.log c) it is stuck connecting and there is a little number icon in the top-right of the processor which shows that 1 thread is still running trying to execute Which of these choices describes the result?
... View more
07-02-2018
02:24 PM
1 Kudo
The remote input host is the hostname that current node will advertise when another NiFi instance asks for information about the cluster, so it can only be one value. As an example NiFi cluster #1 sending data to NiFi cluster #2... the remote input host is only relevant on cluster #2 here. There will be an Remote Process Group (RPG) on cluster #1 with a URL (or comma separated list of URLS) for cluster #2, it will then ask cluster #2 for cluster information, and cluster #2 will respond with hostnames of the nodes in cluster #2, which will be based off the value of remote input host on each node in cluster #2.
... View more
06-26-2018
03:11 PM
ListSFTP is a source processor that does not accept incoming flow files. It is made to track the state of a specific directory on a specific host and find new files, so if those values can change at anytime then the state needs to be reset. This is challenging if those values can change on any given execution while the processor is running, so for this reason the processor does not support incoming flow files and the values can only be changed by stopping and starting the processor with new configuration.
... View more