Community Articles
Find and share helpful community-sourced technical articles
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Super Guru

Using the GetHTTP Processor we grab random images from the DigitalOcean's free image site. I give it a random file name so we can save it uniquely in HDFS.


The Entire Data Flow from GetHTTP to Final HDFS storage of image and it's metadata as JSON.


ExtractMediaMetaData Processor


The final results:

hdfs dfs -cat /mediametadata/random1469112881039.json

{"Number of Components":"3","Resolution Units":"none","Image Height":"200
pixels","File Name":"apache-tika-3181704319795384377.tmp",
"Data Precision":"8 bits",
"File Modified Date":"Thu Jul 21 14:54:43 UTC 2016","tiff:BitsPerSample":"8",
"Compression Type":"Progressive,Huffman","X-Parsed-By":"org.apache.tika.parser.DefaultParser,
"Component 1":"Y component: Quantization table 0, Sampling factors 2 horiz/2vert",
"Component 2":"Cb component: Quantization table 1,Sampling factors 1 horiz/1 vert",
"Component3":"Cr component: Quantization table 1, Sampling factors 1 horiz/1vert",
"X Resolution":"1 dot",
"FileSize":"4701 bytes","tiff:ImageWidth":"200","path":"./",
"filename":"random1469112881039.jpg","ImageWidth":"200 pixels",
"YResolution":"1 dot"}

We have as many images as we want. Using the parameters I picked an image width of always 200. You can customize that.

Below is the image downloaded with the above metadata.


Tags (1)
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
2 of 2
Last update:
‎08-17-2019 11:13 AM
Updated by:
Top Kudoed Authors