Created 01-09-2016 11:26 PM
Trying to implement a data flow using nifi to get a jpeg or png and pdf file from a dir and also extract image and pdf metadata like created_date, filepath, date_modified etc and store the metadat in hdfs. I can go with PutHdfs. We have an processor ExtractImageMetadata but it doesnt have many properties to fulfill my requirement. Please let me know if i need a custom processor here? thank you
Created 01-11-2016 02:55 PM
Tested jpg file from 2 sources my google drive and flickr. I get much more metadata using my flickr file get good metadata that you can use further down your data flow.
I do not believe that this works on pdf as I tested running a pdf through and did not extract any metadata
Sample metadata:
Key: 'Exif IFD0.Date/Time'Value: '2015:11:06 19:06:53' Key: 'Exif IFD0.Make'Value: 'samsung' Key: 'Exif IFD0.Model'Value: 'SM-N910T'
Created 01-11-2016 02:55 PM
Tested jpg file from 2 sources my google drive and flickr. I get much more metadata using my flickr file get good metadata that you can use further down your data flow.
I do not believe that this works on pdf as I tested running a pdf through and did not extract any metadata
Sample metadata:
Key: 'Exif IFD0.Date/Time'Value: '2015:11:06 19:06:53' Key: 'Exif IFD0.Make'Value: 'samsung' Key: 'Exif IFD0.Model'Value: 'SM-N910T'
Created 01-11-2016 03:02 PM
The ExtractImageMetadata processor uses this image metadata extraction library: https://github.com/drewnoakes/metadata-extractor
It is able to process many formats of metadata and types of files including JPEG and PNG. What format is the metadata on your JPEG and PNG files that isn't getting properly extracted?
Unfortunately it does not process PDF format files but that could be a good new processor to have. What functionality and requirements would you like for a ExtractPdfMetadata processor?
Created on 01-11-2016 03:52 PM - edited 08-19-2019 05:18 AM
Thanks for the reply. Can you please share the nifi template you have created for Extractimagemetadata. I cannot see any other properties on this processor. Iam trying to use getFile which fetches a file from a dir like png or jpeg or pdf and passing to Extractimagemetadata processor and store the metadata generated from this processor onto hdfs. May be i am doing it incorrectly can you please share an example or link to use Extractimagemetadata processor. Thank you
Created 01-11-2016 07:26 PM
You shouldn't need to configure the ExtractImageMetadata processor aside from the max number attributes to add to the flowfile (so you don't blow away your java heap). It will automatically extract all the metadata it can from the supported formats. PNG and JPEG are supported file formats so if it's not extracting certain metadata from those then the metadata format may not be supported. Do you know the metadata format you're using?
Extracting metadata from PDF files is not currently supported. If we were to add a new processor, what functionality and requirements would you like for it?
Created 01-11-2016 10:39 PM
thank you @jpercivall. Iam also looking for getting metadata for pdf files. In terms of the requirement i am looking for getting the metadata for pdf files or excel file or any file like dublin core properties like created_date,modified_date,file_location etc.. In terms of extract metadata i have created a relationship from getfile to extractimagemetadata is that right.. as we dont hae a way to get the file metadata from a directory so best way is to create a relation from getfile to extractmetadata isnt it. please let me know?
Created on 01-13-2016 10:06 PM - edited 08-19-2019 05:18 AM
The GetFile processor gets most of the file attributes when it ingests the file (see image). Is there specific attributes that you don't see in your FlowFile being produced by GetFile that you need?
The ExtractImageMetadata processor is solely for extracting metadata that is formatted into the image file itself and won't help to get system level file metadata.
Created 01-13-2016 10:18 PM
@jpercivall. thanks for the reply. So i beleive the ExtractImageMetadata processor gets the png and jpeg image files metadata rather than system level metadata. SO if GetFile is getting most of the attributes then why we need ExtractMetadataProcessor please? Just wondering it would be good if you can share the GetFile and ExtractImageMetadata nifi template which you said you have implemented please. Thank you