Support Questions

Find answers, ask questions, and share your expertise

Apache Nifi Image metadata extraction like created_date etc

avatar
Expert Contributor

Trying to implement a data flow using nifi to get a jpeg or png and pdf file from a dir and also extract image and pdf metadata like created_date, filepath, date_modified etc and store the metadat in hdfs. I can go with PutHdfs. We have an processor ExtractImageMetadata but it doesnt have many properties to fulfill my requirement. Please let me know if i need a custom processor here? thank you

1 ACCEPTED SOLUTION

avatar
New Contributor

Tested jpg file from 2 sources my google drive and flickr. I get much more metadata using my flickr file get good metadata that you can use further down your data flow.

I do not believe that this works on pdf as I tested running a pdf through and did not extract any metadata

Sample metadata:

Key: 'Exif IFD0.Date/Time'Value: '2015:11:06 19:06:53'

Key: 'Exif IFD0.Make'Value: 'samsung'

Key: 'Exif IFD0.Model'Value: 'SM-N910T'

View solution in original post

7 REPLIES 7

avatar
New Contributor

Tested jpg file from 2 sources my google drive and flickr. I get much more metadata using my flickr file get good metadata that you can use further down your data flow.

I do not believe that this works on pdf as I tested running a pdf through and did not extract any metadata

Sample metadata:

Key: 'Exif IFD0.Date/Time'Value: '2015:11:06 19:06:53'

Key: 'Exif IFD0.Make'Value: 'samsung'

Key: 'Exif IFD0.Model'Value: 'SM-N910T'

avatar

The ExtractImageMetadata processor uses this image metadata extraction library: https://github.com/drewnoakes/metadata-extractor

It is able to process many formats of metadata and types of files including JPEG and PNG. What format is the metadata on your JPEG and PNG files that isn't getting properly extracted?

Unfortunately it does not process PDF format files but that could be a good new processor to have. What functionality and requirements would you like for a ExtractPdfMetadata processor?

avatar
Expert Contributor
@jpercivall

Thanks for the reply. Can you please share the nifi template you have created for Extractimagemetadata. I cannot see any other properties on this processor. Iam trying to use getFile which fetches a file from a dir like png or jpeg or pdf and passing to Extractimagemetadata processor and store the metadata generated from this processor onto hdfs. May be i am doing it incorrectly can you please share an example or link to use Extractimagemetadata processor. Thank you

1285-screenshot-from-2016-01-11-15-32-29.png

avatar

You shouldn't need to configure the ExtractImageMetadata processor aside from the max number attributes to add to the flowfile (so you don't blow away your java heap). It will automatically extract all the metadata it can from the supported formats. PNG and JPEG are supported file formats so if it's not extracting certain metadata from those then the metadata format may not be supported. Do you know the metadata format you're using?

Extracting metadata from PDF files is not currently supported. If we were to add a new processor, what functionality and requirements would you like for it?

avatar
Expert Contributor

thank you @jpercivall. Iam also looking for getting metadata for pdf files. In terms of the requirement i am looking for getting the metadata for pdf files or excel file or any file like dublin core properties like created_date,modified_date,file_location etc.. In terms of extract metadata i have created a relationship from getfile to extractimagemetadata is that right.. as we dont hae a way to get the file metadata from a directory so best way is to create a relation from getfile to extractmetadata isnt it. please let me know?

avatar

The GetFile processor gets most of the file attributes when it ingests the file (see image). Is there specific attributes that you don't see in your FlowFile being produced by GetFile that you need?

The ExtractImageMetadata processor is solely for extracting metadata that is formatted into the image file itself and won't help to get system level file metadata.

1351-upload-thumb.png

avatar
Expert Contributor

@jpercivall. thanks for the reply. So i beleive the ExtractImageMetadata processor gets the png and jpeg image files metadata rather than system level metadata. SO if GetFile is getting most of the attributes then why we need ExtractMetadataProcessor please? Just wondering it would be good if you can share the GetFile and ExtractImageMetadata nifi template which you said you have implemented please. Thank you