Support Questions

Find answers, ask questions, and share your expertise

[Solved] nifi: create a hive table of log error

avatar

Hi,

I use nifi 1.0 to read a file and move it in HDFS. While this flow is proceeding it can happen an error (for example exist another file with the same name..... ). When error happen I'd like write an row in a file with a this simple structure: date; namefile; error (if happens an error) OR success(if file is load successfully in hdfs)

Once i have this file, i put it on HDFS if possible. It will populate an Hive Table whereby I can create a dashboard with the situation of file upload.

My problem is: How can I create this file with nifi?

Thanks

1 ACCEPTED SOLUTION

avatar
Master Guru

Wherever the error happens in the flow (sounds like PutHDFS in your example), there is likely a "failure" relationship (or something of the kind) for that processor. You can route failed flow files to a separate branch, where you can perform your error handling. For your example, you can have PutHDFS route "failure" to an UpdateAttribute that sets some attribute like "status" to "error", and PutHDFS could route "success" to an UpdateAttribute that sets "status" to "success".

Assuming your Hive table is created atop CSV files, then at this point you could route both back to a ReplaceText that creates a comma-separated line with the values, using Expression Language to get the date, filename, and the value of the status attribute, so something like: ${now()},${filename},${status}

You should avoid having small files in HDFS, so you wouldn't want to write each individual line as a file to HDFS. Instead consider the MergeContent processor to concatenate many rows together, then use a PutHDFS to stage the larger file in Hadoop for use by Hive. If MergeContent et al doesn't give you the file(s) you need, you can always use an ExecuteScript processor for any custom processing needed.

If your Hive table expects Avro or ORC format for the files, there are processors for these conversions as well (although you may have to convert to intermediate formats such as JSON first, see the documentation for more details).

View solution in original post

3 REPLIES 3

avatar

It is possible by creating a python script?

Someone could help me in the creation of this script?

Thanks

avatar
Master Guru

I've left a possible solution as a separate answer. Doing all the processing with a Python script is not ideal, as you'd need your own Hadoop/Hive client libraries and all you'd use NiFi for is executing the external Python script. However if you just need some custom processing during the flow, you can use ExecuteScript (link in my other answer) with Jython, I have some examples on my blog.

avatar
Master Guru

Wherever the error happens in the flow (sounds like PutHDFS in your example), there is likely a "failure" relationship (or something of the kind) for that processor. You can route failed flow files to a separate branch, where you can perform your error handling. For your example, you can have PutHDFS route "failure" to an UpdateAttribute that sets some attribute like "status" to "error", and PutHDFS could route "success" to an UpdateAttribute that sets "status" to "success".

Assuming your Hive table is created atop CSV files, then at this point you could route both back to a ReplaceText that creates a comma-separated line with the values, using Expression Language to get the date, filename, and the value of the status attribute, so something like: ${now()},${filename},${status}

You should avoid having small files in HDFS, so you wouldn't want to write each individual line as a file to HDFS. Instead consider the MergeContent processor to concatenate many rows together, then use a PutHDFS to stage the larger file in Hadoop for use by Hive. If MergeContent et al doesn't give you the file(s) you need, you can always use an ExecuteScript processor for any custom processing needed.

If your Hive table expects Avro or ORC format for the files, there are processors for these conversions as well (although you may have to convert to intermediate formats such as JSON first, see the documentation for more details).