Created 07-25-2018 02:38 PM
Hi,
I am developping a Nifi WebService to export dataLake content (stored as .parquet) as .csv.
I managed to do it using HiveQL Processor but I want to do it without Hive.
What I imagined was :
- get the .parquet file with WebHDFS (invokeHTTP call from nifi)
- use a nifi processor to convert the .parquet file to .csv
Is there a nifi Processor doing that? The only option I found for now is to use a spark job, which sounds a bit complicated for this purpose.
Thanks.
Created 07-25-2018 06:27 PM
Currently there is nothing OOTB that will parse Parquet files in NiFi, but I have written NIFI-5455 to cover the addition of a ParquetReader, such that incoming Parquet files may be able to be operated on as other supported formats are. As a workaround, there is a ScriptedReader where you could write your own in Groovy, Javascript, Jython, etc.
Created 07-25-2018 02:49 PM
I believe FetchParquet does what you need;
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-parquet-nar/1.5.0/org.apache....
Created 07-25-2018 03:05 PM
Thanks for your answer, but as I understand it : FetchParquet will get the .parquet file and put its content in the flowFile, but it won't help to export it as .csv.
The flowFile content will still be the binary parquet version of the data.
I plan to do the equivalent of fetchParquet with a REST call to WebHDFS.
Created 07-25-2018 06:27 PM
Currently there is nothing OOTB that will parse Parquet files in NiFi, but I have written NIFI-5455 to cover the addition of a ParquetReader, such that incoming Parquet files may be able to be operated on as other supported formats are. As a workaround, there is a ScriptedReader where you could write your own in Groovy, Javascript, Jython, etc.
Created 07-26-2018 08:15 AM
Thanks Matt for the clear answer !
Created 07-26-2018 01:26 PM
Just wanted to add some more info...
The Parquet Java API only allows reading and writing to and from Hadoop's Filesystem API, this is why NiFi currently can't provide a standard record reader and writer because those require reading and writing to Java's InputStream and OutputStream, which Parquet doesn't provide.
So PutParquet can be configured with a record reader to handle any incoming data, and then converts it to Parquet and writes to HDFS, basically it has a record writer encapsulated in it that can only write to HDFS.
FetchParquet does the reverse where it can read Parquet files from HDFS and then can be configured with a record writer to write them out as any form, in your case CSV.
You can always create core-site.xml with a local filesystem to trick the Parquet processors into using local disk instead of HDFS.
Created 08-14-2018 01:03 PM
I want to read the parquet file and convert the each record into a json flow file. However FetchParquet will get the .parquet file and put its content in a single flowFile, but it doesn't read each record invidually from the parquet file into a flow files record by record. Anyway that I can read the parquet file and convert the each record into a json flow file using NiFi?
Created 08-14-2018 01:37 PM
FetchParquet has a property for a record writer... when it fetches the parquet, it will read it record by record using Parquet's Avro reader, and then pass each record to the configured writer.
So if you configured it with a JSON record writer, then the resulting flow file that is fetched will contain JSON.
If you wanted to fetch raw parquet then you wouldn't use FetchParquet, but would instead just use FetchHDFS which fetches bytes unmodified.
Created 08-14-2018 02:29 PM
Thanks for the quick reply, @ Bryan Bende I tried to use the Jason record writer, but I don't have parquet schema information. How could I configure the Jason record writer, so It will get each message as a flow file?
Schema Write Strategy Set 'schema.name' Attribute Schema Access Strategy Use 'Schema Name' Property Schema RegistryIncompatible Controller Service ConfiguredSchema Name ${schema.name} Schema Text ${avro.schema} Date FormatNo value setTime FormatNo value setTimestamp FormatNo value setPretty Print JSON false Schema Write Strategy Set 'schema.name' Attribute Schema Access Strategy Use 'Schema Name' Property Schema RegistryIncompatible Controller Service ConfiguredSchema Name ${schema.name} Schema Text ${avro.schema} Date FormatNo value setTime FormatNo value setTimestamp FormatNo value setPretty Print JSON false
Created 08-14-2018 02:33 PM
The parquet data itself has the schema, and your writer should be configured with schema access strategy to inherit schema from reader.
Schema Access Strategyinherit-record-schema
This will produce a flow file with many records.
If you need 1 record per flow file then you would use SplitRecord after this, however generally it is better to keep many records together.