About mburgess

mburgess · ‎05-30-2018

Are your files still in the source directory? GetFile defaults to removing the files you specify, so if something went wrong you might not have those files anymore, you'd need to set Keep Source File to true in that case (as Matt Clarke recommended). That's why the recommended option is ListFile->FetchFile, as it will keep track of the files it's seen and not fetch those again.

mburgess · ‎05-30-2018

Adding to Bryan's answer, if you have the schema available to put in the registry, you can set it to Validate Field Names to false, meaning you could have field names defined in the Avro schema that do not conform to the stricter Avro rules. We should consider adding this property to readers that generate their own schema, such as CSVReader...

mburgess · ‎05-29-2018

InferAvroSchema should have Schema Output Destination set to "flowfile-attribute", the outgoing flow file should contain the CSV data and an attribute called "inferred.avro.schema" which contains the schema to use. Then in ConvertCSVToAvro you can set the Record Schema property to "${inferred.avro.schema}" which will cause it to use the inferred schema for conversion. Since you are entering the CSV Header Definition manually, you may find it more helpful to create an Avro schema manually and use ConvertRecord instead of InferAvroSchema -> ConvertCSVToAvro. If you don't know the datatypes of the columns and are thus relying on InferAvroSchema to do that for you, you could still use ConvertRecord instead of ConvertCSVToAvro.

mburgess · ‎05-29-2018

The HiveConnectionPool is a special type of DBCPConnectionPool, so instances of it get listed with all the others, as it is not the connection pool that doesn't support the operations, but the driver itself. What format is the input data in? You should be able to use ConvertRecord with a FreeFormTextWriter to generate SQL from your input data (don't forget the semicolon at the end of the line), then you can send that to PutHiveQL.

mburgess · ‎05-29-2018

ListHDFS emits empty (0-byte) flow files that have attributes (such as filename and path, see the doc for details) set on them. In this case FetchHDFS is running way more slowly than ListHDFS (it takes longer to retrieve the file than to list that it's there), which is why you get the backup. Also setting Max Size as a backpressure trigger won't work here since they are 0-byte files. Try setting Max number of Objects for backpressure instead.

mburgess · ‎05-29-2018

What version of the Hive driver are you using? I'm not sure there is a version of the Hive driver available that supports all the JDBC API calls made by PutDatabaseRecord, such as executeBatch(). Also since the Hive JDBC driver auto-commits after each operation, PutDatabaseRecord + Hive would not be any more performant than using PutHiveQL. In an upcoming version of NiFi/HDF (for Hive 3.0), you should be able to use PutHive3Streaming to do what you want.

mburgess · ‎05-27-2018

The variables are in a ComponentVariableRegistry which is pretty well-hidden under the NiFi API. Usually you get at variables by evaluating Expression Language in the context of a processor property. In this case I set a Process Group variable called "myAttr" to "myValue", then I configured ExecuteScript like so: Note that I created a user-defined property "myProp" whose value is an expression language construct containing the PG variable name. When you call evaluateAttributeExpressions() on that property (see the script) it will resolve the value of "myAttr" and return that, you can verify that an outgoing flow file would now have "myFlowFileAttr" set to "myValue".

mburgess · ‎05-27-2018

What does your CREATE TABLE statement look like, and does it match the schema of the file(s) (Avro/ORC) you're sending to the external location?

mburgess · ‎05-25-2018

Your PutHDFS processor is placing the data into Hadoop (in ORC format after ConvertAvroToORC) for use by Hive, so you don't also need to send an INSERT statement to PutHiveQL. Rather with the pattern you're using, you should have ReplaceText setting the content to a Hive DDL statement to create a table on top of the ORC file(s) location, or a LOAD DATA INPATH to load from the HDFS location into an existing Hive table.

mburgess · ‎05-23-2018

I left a response on my other answer but will leave it here too in case you hadn't seen it: Looking at the parquet-avro code, I think your suggestion of the workaround to change decimal values to fixed is the right approach (for now). We could update the version of parquet-avro but I didn't see anything in there that would improve your situation, it was Impala that needed to support more incoming types.

Online	Offline
Last Visited	‎10-29-2025 10:31 AM

Member Since	‎11-16-2015 02:21 PM
Last Visited	‎10-29-2025 10:31 AM
Posts	905
Kudos received	659

Cloudera Community

Re: Compare data within the JSON using NIFI

Re: how to join three csv files like sql on condit...

Re: How to see the Data Provenance and Lineage in ...

Re: Apache NiFi - RouteText has no matches

Re: Nifi Building error when creating a brand new ...

Re: Using NiFi to load data from localFS to HDFS

Re: NiFI - Converting CSV to Avro, header contains...

Re: NiFi: ConvertCSVToAvro processor is unable to ...

Re: Error: PutDatabaseRecord + Hive connection poo...

Re: NiFi 1.4 queue shows millions of files and 0 M...

Re: Error: PutDatabaseRecord + Hive connection poo...

Re: NiFi: Is it possible to access Processor Group...

Re: Oracle to Hive table

Re: Oracle to Hive table

Re: Avro logical types data format from QueryDatab...