Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to export parquet file to csv (without Hive)

avatar
Explorer

Hi,

I am developping a Nifi WebService to export dataLake content (stored as .parquet) as .csv.

I managed to do it using HiveQL Processor but I want to do it without Hive.

What I imagined was :

- get the .parquet file with WebHDFS (invokeHTTP call from nifi)

- use a nifi processor to convert the .parquet file to .csv

Is there a nifi Processor doing that? The only option I found for now is to use a spark job, which sounds a bit complicated for this purpose.

Thanks.

1 ACCEPTED SOLUTION

avatar
Master Guru

Currently there is nothing OOTB that will parse Parquet files in NiFi, but I have written NIFI-5455 to cover the addition of a ParquetReader, such that incoming Parquet files may be able to be operated on as other supported formats are. As a workaround, there is a ScriptedReader where you could write your own in Groovy, Javascript, Jython, etc.

View solution in original post

10 REPLIES 10

avatar

avatar
Explorer

Thanks for your answer, but as I understand it : FetchParquet will get the .parquet file and put its content in the flowFile, but it won't help to export it as .csv.

The flowFile content will still be the binary parquet version of the data.

I plan to do the equivalent of fetchParquet with a REST call to WebHDFS.

avatar
Master Guru

Currently there is nothing OOTB that will parse Parquet files in NiFi, but I have written NIFI-5455 to cover the addition of a ParquetReader, such that incoming Parquet files may be able to be operated on as other supported formats are. As a workaround, there is a ScriptedReader where you could write your own in Groovy, Javascript, Jython, etc.

avatar
Explorer

Thanks Matt for the clear answer !

avatar
Master Guru

Just wanted to add some more info...

The Parquet Java API only allows reading and writing to and from Hadoop's Filesystem API, this is why NiFi currently can't provide a standard record reader and writer because those require reading and writing to Java's InputStream and OutputStream, which Parquet doesn't provide.

So PutParquet can be configured with a record reader to handle any incoming data, and then converts it to Parquet and writes to HDFS, basically it has a record writer encapsulated in it that can only write to HDFS.

FetchParquet does the reverse where it can read Parquet files from HDFS and then can be configured with a record writer to write them out as any form, in your case CSV.

You can always create core-site.xml with a local filesystem to trick the Parquet processors into using local disk instead of HDFS.

avatar
Rising Star

I want to read the parquet file and convert the each record into a json flow file. However FetchParquet will get the .parquet file and put its content in a single flowFile, but it doesn't read each record invidually from the parquet file into a flow files record by record. Anyway that I can read the parquet file and convert the each record into a json flow file using NiFi?

avatar
Master Guru

FetchParquet has a property for a record writer... when it fetches the parquet, it will read it record by record using Parquet's Avro reader, and then pass each record to the configured writer.

So if you configured it with a JSON record writer, then the resulting flow file that is fetched will contain JSON.

If you wanted to fetch raw parquet then you wouldn't use FetchParquet, but would instead just use FetchHDFS which fetches bytes unmodified.

avatar
Rising Star

Thanks for the quick reply, @ Bryan Bende I tried to use the Jason record writer, but I don't have parquet schema information. How could I configure the Jason record writer, so It will get each message as a flow file?

Schema Write Strategy
Set 'schema.name' Attribute
Schema Access Strategy
Use 'Schema Name' Property
Schema RegistryIncompatible Controller Service ConfiguredSchema Name
${schema.name}
Schema Text
${avro.schema}
Date FormatNo value setTime FormatNo value setTimestamp FormatNo value setPretty Print JSON
false
Schema Write Strategy
Set 'schema.name' Attribute
Schema Access Strategy
Use 'Schema Name' Property
Schema RegistryIncompatible Controller Service ConfiguredSchema Name
${schema.name}
Schema Text
${avro.schema}
Date FormatNo value setTime FormatNo value setTimestamp FormatNo value setPretty Print JSON
false

avatar
Master Guru

The parquet data itself has the schema, and your writer should be configured with schema access strategy to inherit schema from reader.

Schema Access Strategyinherit-record-schema

  • Use 'Schema Name' Property The name of the Schema to use is specified by the 'Schema Name' Property. The value of this property is used to lookup the Schema in the configured Schema Registry service.
  • Inherit Record Schema The schema used to write records will be the same schema that was given to the Record when the Record was created.
  • Use 'Schema Text' Property The text of the Schema itself is specified by the 'Schema Text' Property. The value of this property must be a valid Avro Schema. If Expression Language is used, the value of the 'Schema Text' property must be valid after substituting the expressions.

This will produce a flow file with many records.

If you need 1 record per flow file then you would use SplitRecord after this, however generally it is better to keep many records together.