Member since
11-16-2015
905
Posts
666
Kudos Received
249
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 509 | 09-30-2025 05:23 AM | |
| 826 | 06-26-2025 01:21 PM | |
| 751 | 06-19-2025 02:48 PM | |
| 935 | 05-30-2025 01:53 PM | |
| 11682 | 02-22-2024 12:38 PM |
05-30-2018
03:39 AM
Are your files still in the source directory? GetFile defaults to removing the files you specify, so if something went wrong you might not have those files anymore, you'd need to set Keep Source File to true in that case (as Matt Clarke recommended). That's why the recommended option is ListFile->FetchFile, as it will keep track of the files it's seen and not fetch those again.
... View more
05-30-2018
03:06 AM
Adding to Bryan's answer, if you have the schema available to put in the registry, you can set it to Validate Field Names to false, meaning you could have field names defined in the Avro schema that do not conform to the stricter Avro rules. We should consider adding this property to readers that generate their own schema, such as CSVReader...
... View more
05-29-2018
05:31 PM
InferAvroSchema should have Schema Output Destination set to "flowfile-attribute", the outgoing flow file should contain the CSV data and an attribute called "inferred.avro.schema" which contains the schema to use. Then in ConvertCSVToAvro you can set the Record Schema property to "${inferred.avro.schema}" which will cause it to use the inferred schema for conversion. Since you are entering the CSV Header Definition manually, you may find it more helpful to create an Avro schema manually and use ConvertRecord instead of InferAvroSchema -> ConvertCSVToAvro. If you don't know the datatypes of the columns and are thus relying on InferAvroSchema to do that for you, you could still use ConvertRecord instead of ConvertCSVToAvro.
... View more
05-29-2018
05:03 PM
The HiveConnectionPool is a special type of DBCPConnectionPool, so instances of it get listed with all the others, as it is not the connection pool that doesn't support the operations, but the driver itself. What format is the input data in? You should be able to use ConvertRecord with a FreeFormTextWriter to generate SQL from your input data (don't forget the semicolon at the end of the line), then you can send that to PutHiveQL.
... View more
05-29-2018
04:09 PM
1 Kudo
ListHDFS emits empty (0-byte) flow files that have attributes (such as filename and path, see the doc for details) set on them. In this case FetchHDFS is running way more slowly than ListHDFS (it takes longer to retrieve the file than to list that it's there), which is why you get the backup. Also setting Max Size as a backpressure trigger won't work here since they are 0-byte files. Try setting Max number of Objects for backpressure instead.
... View more
05-29-2018
03:43 PM
What version of the Hive driver are you using? I'm not sure there is a version of the Hive driver available that supports all the JDBC API calls made by PutDatabaseRecord, such as executeBatch(). Also since the Hive JDBC driver auto-commits after each operation, PutDatabaseRecord + Hive would not be any more performant than using PutHiveQL. In an upcoming version of NiFi/HDF (for Hive 3.0), you should be able to use PutHive3Streaming to do what you want.
... View more
05-27-2018
07:35 PM
1 Kudo
The variables are in a ComponentVariableRegistry which is pretty well-hidden under the NiFi API. Usually you get at variables by evaluating Expression Language in the context of a processor property. In this case I set a Process Group variable called "myAttr" to "myValue", then I configured ExecuteScript like so: Note that I created a user-defined property "myProp" whose value is an expression language construct containing the PG variable name. When you call evaluateAttributeExpressions() on that property (see the script) it will resolve the value of "myAttr" and return that, you can verify that an outgoing flow file would now have "myFlowFileAttr" set to "myValue".
... View more
05-27-2018
03:36 AM
1 Kudo
What does your CREATE TABLE statement look like, and does it match the schema of the file(s) (Avro/ORC) you're sending to the external location?
... View more
05-25-2018
11:42 PM
1 Kudo
Your PutHDFS processor is placing the data into Hadoop (in ORC format after ConvertAvroToORC) for use by Hive, so you don't also need to send an INSERT statement to PutHiveQL. Rather with the pattern you're using, you should have ReplaceText setting the content to a Hive DDL statement to create a table on top of the ORC file(s) location, or a LOAD DATA INPATH to load from the HDFS location into an existing Hive table.
... View more
05-23-2018
03:49 PM
I left a response on my other answer but will leave it here too in case you hadn't seen it: Looking at the parquet-avro code, I think your suggestion of the workaround to change decimal values to fixed is the right approach (for now). We could update the version of parquet-avro but I didn't see anything in there that would improve your situation, it was Impala that needed to support more incoming types.
... View more