Support Questions

t_lebrun · ‎07-25-2018

Hi,

I am developping a Nifi WebService to export dataLake content (stored as .parquet) as .csv.

I managed to do it using HiveQL Processor but I want to do it without Hive.

What I imagined was :

- get the .parquet file with WebHDFS (invokeHTTP call from nifi)

- use a nifi processor to convert the .parquet file to .csv

Is there a nifi Processor doing that? The only option I found for now is to use a spark job, which sounds a bit complicated for this purpose.

Thanks.

mburgess · ‎07-25-2018

Currently there is nothing OOTB that will parse Parquet files in NiFi, but I have written NIFI-5455 to cover the addition of a ParquetReader, such that incoming Parquet files may be able to be operated on as other supported formats are. As a workaround, there is a ScriptedReader where you could write your own in Groovy, Javascript, Jython, etc.

View solution in original post

JonathanSneep · ‎07-25-2018

I believe FetchParquet does what you need;
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-parquet-nar/1.5.0/org.apache....

t_lebrun · ‎07-25-2018

Thanks for your answer, but as I understand it : FetchParquet will get the .parquet file and put its content in the flowFile, but it won't help to export it as .csv.

The flowFile content will still be the binary parquet version of the data.

I plan to do the equivalent of fetchParquet with a REST call to WebHDFS.

mburgess · ‎07-25-2018

Currently there is nothing OOTB that will parse Parquet files in NiFi, but I have written NIFI-5455 to cover the addition of a ParquetReader, such that incoming Parquet files may be able to be operated on as other supported formats are. As a workaround, there is a ScriptedReader where you could write your own in Groovy, Javascript, Jython, etc.

t_lebrun · ‎07-26-2018

Thanks Matt for the clear answer !

bbende · ‎07-26-2018

Just wanted to add some more info...

The Parquet Java API only allows reading and writing to and from Hadoop's Filesystem API, this is why NiFi currently can't provide a standard record reader and writer because those require reading and writing to Java's InputStream and OutputStream, which Parquet doesn't provide.

So PutParquet can be configured with a record reader to handle any incoming data, and then converts it to Parquet and writes to HDFS, basically it has a record writer encapsulated in it that can only write to HDFS.

FetchParquet does the reverse where it can read Parquet files from HDFS and then can be configured with a record writer to write them out as any form, in your case CSV.

You can always create core-site.xml with a local filesystem to trick the Parquet processors into using local disk instead of HDFS.

toandyliang · ‎08-14-2018

I want to read the parquet file and convert the each record into a json flow file. However FetchParquet will get the .parquet file and put its content in a single flowFile, but it doesn't read each record invidually from the parquet file into a flow files record by record. Anyway that I can read the parquet file and convert the each record into a json flow file using NiFi?

bbende · ‎08-14-2018

FetchParquet has a property for a record writer... when it fetches the parquet, it will read it record by record using Parquet's Avro reader, and then pass each record to the configured writer.

So if you configured it with a JSON record writer, then the resulting flow file that is fetched will contain JSON.

If you wanted to fetch raw parquet then you wouldn't use FetchParquet, but would instead just use FetchHDFS which fetches bytes unmodified.

toandyliang · ‎08-14-2018

Thanks for the quick reply, @ Bryan Bende I tried to use the Jason record writer, but I don't have parquet schema information. How could I configure the Jason record writer, so It will get each message as a flow file?

Schema Write Strategy
Set 'schema.name' Attribute
Schema Access Strategy
Use 'Schema Name' Property
Schema RegistryIncompatible Controller Service ConfiguredSchema Name
${schema.name}
Schema Text
${avro.schema}
Date FormatNo value setTime FormatNo value setTimestamp FormatNo value setPretty Print JSON
false
Schema Write Strategy
Set 'schema.name' Attribute
Schema Access Strategy
Use 'Schema Name' Property
Schema RegistryIncompatible Controller Service ConfiguredSchema Name
${schema.name}
Schema Text
${avro.schema}
Date FormatNo value setTime FormatNo value setTimestamp FormatNo value setPretty Print JSON
false

bbende · ‎08-14-2018

The parquet data itself has the schema, and your writer should be configured with schema access strategy to inherit schema from reader.

Schema Access Strategyinherit-record-schema

Use 'Schema Name' Property
Inherit Record Schema
Use 'Schema Text' Property

This will produce a flow file with many records.

If you need 1 record per flow file then you would use SplitRecord after this, however generally it is better to keep many records together.

Cloudera Community

Support Questions

How to export parquet file to csv (without Hive)