Community Articles

anatva · ‎09-09-2017

This article seeks to walk you through the process developed in order to classify a given set of images into one of the x number of categories with the help of training datasets (of images) & a deep learning image recognition model "InceptionV3" & RanomForest classification algorithm. The technologies used are tensorflow & spark on hadoop platform. It inovles the following modules:

Technology Stack:

Spark 2.1.0, Python 3.5, HDP 2.6.1

Prepare the Training datasets:

1. Install the ImageMagick package using "yum install imagemagick"

2. Obtain sample images from the customer and also the label that needs to be associated with them

2. Extract the label (metadata) from each image using "ImageMagick" tool and place each image into different folders

prepareTrainingData.sh

This is a bash script that extracts the metadata from the image and puts them into different bins say category 1 through 5. These images were manually classified by analyzing them with naked eye. These images will be used for training our model.

#!/bin/bash

IMAGES="/home/arun/image-classification/Images"
UNLABELED="./UnlabeledImages.txt"
TRAININGDATA="/home/arun/image-classification/TrainingData"
rm "$UNLABELED"

for i in $(ls -1 "$IMAGES")
do
  for j in $(ls -1 "$IMAGES"/"$i")
    do
      rating=`identify -verbose "$IMAGES"/"$i"/"$j" | grep xmp:Rating | cut -d':' -f3`
      rtng=`echo "$rating" | awk '{$1=$1};1'`
      case "$rtng" in
          1) echo " this is cat 1"
             cp "$IMAGES"/"$i"/"$j" "$TRAININGDATA"/cat1/
             ;;
          2) echo "this is cat 2"
             cp "$IMAGES"/"$i"/"$j" "$TRAININGDATA"/cat2/
             ;;
          3) echo "this is cat 3"
             cp "$IMAGES"/"$i"/"$j" "$TRAININGDATA"/cat3/
             ;;
          4) echo "this is cat 4"
             cp "$IMAGES"/"$i"/"$j" "$TRAININGDATA"/cat4/
             ;;
          5) echo "thi is cat 5"
             cp "$IMAGES"/"$i"/"$j" "$TRAININGDATA"/cat5/
             ;;
          *) echo "this is someting else"
             echo "$j" >> "$UNLABELED"
             ;;
      esac
    done
done

Classify Images:

1. Install the python packages numpy, keras, tensorflow, nose, pillow, h5py, py4j on all the gateway & worker nodes of the cluster. You can use either pip or anaconda for this.

2. Start a pyspark session and download a spark deep learning library from Databricks that runs on top of tensorflow and uses other python packages that we installed before. This spark DL library provides an interface to perform functions such as reading images into a spark dataframe, applying the InceptionV3 model and extract features from the images etc.,

3. In the pyspark session, read the images into a dataframe and split the images into training and test dataframes.

4. Create a spark ml pipeline and add the stages 1) ImageFeaturizer 2) RandomForest Classifier

5. Execute the fit function and obtain a model

6. Predict using the model & also calculate the prediction accuracy

### Fire up a pyspark session
export PYSPARK_PYTHON=/opt/anaconda3/bin/python3
export SPARK_HOME=/usr/hdp/current/spark2-client
$SPARK_HOME/bin/pyspark --packages databricks:spark-deep-learning:0.1.0-spark2.1-s_2.11 --master yarn --executor-memory 3g --driver-memory 5g --conf spark.yarn.executor.memoryOverhead=5120

### Add the spark deep-learning jars into the classpath
import sys,glob,os
sys.path.extend(glob.glob(os.path.join(os.path.expanduser("~"),".ivy2/jars/*.jar")))

### PySpark code to read images, create spark ml pipeline, train the mode & predict
from sparkdl import readImages
from pyspark.sql.functions import lit
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline
from sparkdl import DeepImageFeaturizer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

img_dir = "/user/arun/TrainingData"

cat1_df = readImages(img_dir + "/cat1").withColumn("label", lit(1))
cat2_df = readImages(img_dir + "/cat2").withColumn("label", lit(2))
cat3_df = readImages(img_dir + "/cat3").withColumn("label", lit(3))
cat4_df = readImages(img_dir + "/cat4").withColumn("label", lit(4))
cat5_df = readImages(img_dir + "/cat5").withColumn("label", lit(5))

//Split the images where 90% of them go to training data, 10% go to test data

cat1_train, cat1_test = cat1_df.randomSplit([0.9, 0.1])
cat2_train, cat2_test = cat2_df.randomSplit([0.9, 0.1])
cat3_train, cat3_test = cat3_df.randomSplit([0.9, 0.1])
cat4_train, cat4_test = cat4_df.randomSplit([0.9, 0.1])
cat5_train, cat5_test = cat5_df.randomSplit([0.9, 0.1])

train_df = cat1_train.unionAll(cat2_train).unionAll(cat3_train).unionAll(cat4_train).unionAll(cat5_train)
test_df = cat1_test.unionAll(cat2_test).unionAll(cat3_test).unionAll(cat4_test).unionAll(cat5_test)

featurizer = DeepImageFeaturizer(inputCol="image", outputCol="features", modelName="InceptionV3")
rf = RandomForestClassifier(labelCol="label", featuresCol="features")

p = Pipeline(stages=[featurizer, rf])
p_model = p.fit(train_df)

predictions = p_model.transform(test_df)
predictions.select("filePath", "label", "prediction").show(200,truncate=False)
preds_vs_labels = predictions.select("prediction", "label")
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
print("accuracy of predictions by model = " + str(evaluator.evaluate(preds_vs_labels)))

# TRY TO CLASSIFY CAT 5 IMAGES, AND SEE HOW CLOSE THEY GET IN PREDICTING
cat5_imgs = readImages(img_dir + "/cat5").withColumn("label", lit(5))
pred5 = p_model.transform(cat5_imgs)
pred5.select("filePath","label","prediction").show(200,truncate=False)

Reference: https://medium.com/linagora-engineering/making-image-classification-simple-with-spark-deep-learning-...

Cloudera Community

Community Articles

Image Classification with TensorFlow & Spark

Tensorflow

Executing TensorFlow Classifications from Apache N...

Analyzing images in HDF 2.0 using TensorFlow

Real Time Image Classification on Twitter Data

How can we delete classifications from deleted ent...

IoT: Capturing Photos and Analyzing The Image wit...

Spark 3 legacy configurations list ( Spark 2 behav...

Integrating TensorFlow 1.6 Image Labelling with HD...

Spark Python Supportability Matrix

Tensorflow Serving with Docker on YARN

Spark and Java versions Supportability Matrix