Support Questions

aaronmdunlap · ‎09-05-2017

I'm taking raw pipe-delimited text files, converting them to Avro and then converting them to ORC files (because ORC files are awesome), and everything is working swimmingly, except the conversion from Avro to ORC is extremely slow, which is causing my processing to back up infinitely.

Is there a better method to convert raw text into an ORC file in NiFi or some kind of efficiency that can be gained to allow the data to flow through much faster?

pvillard · ‎09-06-2017

Hi @Aaron Dunlap,

Depending of the HDF version you are using, you could leverage the record-oriented processors to perform the CSV - Avro conversion in a much more efficient way. Then I assume you're doing a conversion into ORC format to query the data using Hive. If that's the case, a common pattern is to let Hive do the conversion: what I usually do is to send the data into a landing folder in HDFS as Avro data, then I use a PutHiveQL processor to execute few queries (one to create a temporary table on top of the avro data using the corresponding avro schema, one to insert select the data from the temporary table to the final table which is ORC, and one to delete the temporary table), and then a DeleteHDFS processor to delete the data used to create the temporary table (because the drop table statement does not delete the data if you created a temporary external table).

There is an ORC reader/writer on the roadmap that will replace all of that (you'll be able to directly convert from CSV to ORC using record-oriented processors) but that's not ready yet.

Hope this helps.

View solution in original post

pvillard · ‎09-06-2017

Hi @Aaron Dunlap,

Depending of the HDF version you are using, you could leverage the record-oriented processors to perform the CSV - Avro conversion in a much more efficient way. Then I assume you're doing a conversion into ORC format to query the data using Hive. If that's the case, a common pattern is to let Hive do the conversion: what I usually do is to send the data into a landing folder in HDFS as Avro data, then I use a PutHiveQL processor to execute few queries (one to create a temporary table on top of the avro data using the corresponding avro schema, one to insert select the data from the temporary table to the final table which is ORC, and one to delete the temporary table), and then a DeleteHDFS processor to delete the data used to create the temporary table (because the drop table statement does not delete the data if you created a temporary external table).

There is an ORC reader/writer on the roadmap that will replace all of that (you'll be able to directly convert from CSV to ORC using record-oriented processors) but that's not ready yet.

Hope this helps.

pvillard · ‎09-06-2017

Another option, without modifying your current workflow, is to configure your ConvertAvroToORC processor to use parallel threads. To do that, you can change the "concurrent tasks" parameter in the "scheduling" tab of the configuration.

aaronmdunlap · ‎09-06-2017

This made a *huge* difference. I'll accept the top level answer, but parallel processing made a big difference in this case. The processor itself is still fairly slow, but that may be a function of the action that its taking. I'm wondering if moving the data into memory prior to processing would make any difference.

Thanks for the heads up though!

Reus · ‎01-28-2020

Hi @pvillard,

Would you happen to have a sample template for this process?

Thanks

Tim

Cloudera Community

Support Questions

In NiFi, the ConvertAvroToORC processor is extremely slow