Image Clustering Using NimbusML Pipeline

NimbusML implements TensorFlowScorer that allows to use pretrained deep neural net models as featurizers. Users can use any intermediate output as the transform of image pixel values.

In this example, we develop a clustering model using NimbusML pipeline to group images into 10 groups (clusters). The images are downloaded from Wikipedia Commons and English Wikipedia. For more details about the images, please refer to readme. The image files are loaded with NimbusML image loader and processed with pretrained TensorFlow (https://www.tensorflow.org/) deep neural net (DNN) model (e.g. Inception V3) for the feature extraction.

Note:

The user needs to download the Alexnet tensorflow model from here and extract the "alexnet_frozen.pb" and put it in the same directory as the notebook.

Preparing Data

The image loader of NimbusML uses as input a column from a pandas dataframe that indicates the full path to the image. Therefore, the user needs to prepare a csv/tsv file that includes the path information. For classification, the label should be in the same file.

import os
import pandas as pd
import numpy as np
import math
import requests
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from PIL import Image
from io import BytesIO
from nimbusml import Pipeline
from nimbusml.feature_extraction.image import Loader, Resizer, PixelExtractor
from nimbusml.preprocessing import TensorFlowScorer
from nimbusml.decomposition import PcaTransformer
from nimbusml.preprocessing.schema import ColumnDropper
from nimbusml.cluster import KMeansPlusPlus
# Load image summary data from github
url = "https://express-tlcresources.azureedge.net/datasets/DogBreedsVsFruits/DogFruitWiki.SHUF.117KB.735-rows.tsv"
df_train = pd.read_csv(url, sep = "\t", nrows = 100)
df_train['ImagePath_full'] = "https://express-tlcresources.azureedge.net/datasets/DogBreedsVsFruits/" + \
                         df_train['ImagePath']
df_train.head()
Label Title Url ImagePath ImagePath_full
0 dog Bearded Collie https://upload.wikimedia.org/wikipedia/commons... images\dog\Bearded_Collie_600.jpg https://express-tlcresources.azureedge.net/dat...
1 fruit Muntries https://upload.wikimedia.org/wikipedia/commons... images\fruit\1200px-Kunzea_pomifera_flowers.jpg https://express-tlcresources.azureedge.net/dat...
2 dog Griffon Nivernais https://upload.wikimedia.org/wikipedia/commons... images\dog\Griffon_nivernais.jpg https://express-tlcresources.azureedge.net/dat...
3 fruit Ziziphus https://upload.wikimedia.org/wikipedia/commons... images\fruit\1200px-Zizyphus_zizyphus_Ypey54.jpg https://express-tlcresources.azureedge.net/dat...
4 fruit Papaya https://upload.wikimedia.org/wikipedia/commons... images\fruit\Carica_papaya_-_Köhler–s_Medizina... https://express-tlcresources.azureedge.net/dat...
# Download images from url, save to local image_temp folder, update the full path to local directory
base_path = os.path.join(os.getcwd(),"image_temp") 
base_path_dog = os.path.join(base_path,"images","dog")
base_path_fruit = os.path.join(base_path,"images","fruit")

for path in [base_path,base_path_dog, base_path_fruit]:
    if not os.path.exists(path):
        os.makedirs(path)

for _,row in df_train.iterrows():
    try:
        response = requests.get(row["ImagePath_full"])
        Image.open(BytesIO(response.content)).save(os.path.join(base_path,row["ImagePath"]))
        df_train.loc[_,'ImagePath'] = os.path.join(base_path,row["ImagePath"])
        if _%20 == 0:
            print("Dowloading " + str(_) + "/" + str(len(df_train)) + " images...")
    except:
        df_train.drop(_)
df_train.head()
print("Done")
Dowloading 0/100 images...
Dowloading 20/100 images...
Dowloading 40/100 images...
Dowloading 60/100 images...
Dowloading 80/100 images...
Done

The "ImagePath" column that includes the full image path can be passed to the NimbusML image loader.

Training Model

In order to extract image features using the deep learning model, four transformations are needed. 1. Loader: load the image files from the "ImgPath" column of the input file 2. Resizer: as the pretrained DNN model uses an image with width and height 299, we need to resize the image 3. PixelExtractor: we need to extract the image tensor from the image to numeric features 4. TensorFlowScorer: apply the DNN model to the extracted features.

loader = Loader(columns = {'Placeholder':'ImagePath'}) # columns = {output_col_name:input_col_name}
resizer = Resizer(image_width=227, 
                  image_height=227, 
                  columns = ['Placeholder'])  # equivalent to columns = {'Placeholder':'Placeholder'}
pix_extractor = PixelExtractor(columns = ['Placeholder'],
                               interleave = True)
dnn_featurizer = TensorFlowScorer(
                                  model_location=r'alexnet_frozen.pb',
                                  columns={'Relu_1': 'Placeholder'}
                                  )
drop_input = ColumnDropper(columns = ['Placeholder'])

We create a pipeline that only has those transformations to see constructed image features.

ppl1 = Pipeline([loader, resizer, pix_extractor, dnn_featurizer, drop_input])
transformed = ppl1.fit_transform(df_train) 
transformed.head()
Label Title Url ImagePath ImagePath_full Relu_1.0 Relu_1.1 Relu_1.2 Relu_1.3 Relu_1.4 ... Relu_1.4086 Relu_1.4087 Relu_1.4088 Relu_1.4089 Relu_1.4090 Relu_1.4091 Relu_1.4092 Relu_1.4093 Relu_1.4094 Relu_1.4095
0 dog Bearded Collie https://upload.wikimedia.org/wikipedia/commons... C:\Users\v-tshuan\Programs\NimbusML-Samples\sa... https://express-tlcresources.azureedge.net/dat... 0.0 0.000000 0.000000 1.420249 0.0 ... 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.0
1 fruit Muntries https://upload.wikimedia.org/wikipedia/commons... C:\Users\v-tshuan\Programs\NimbusML-Samples\sa... https://express-tlcresources.azureedge.net/dat... 0.0 0.000000 0.000000 0.000000 0.0 ... 0.000000 3.416956 0.0 0.0 0.464706 0.000000 0.000000 0.0 0.0 0.0
2 dog Griffon Nivernais https://upload.wikimedia.org/wikipedia/commons... C:\Users\v-tshuan\Programs\NimbusML-Samples\sa... https://express-tlcresources.azureedge.net/dat... 0.0 0.088755 0.811384 0.000000 0.0 ... 2.170286 0.000000 0.0 0.0 0.000000 0.000000 1.110376 0.0 0.0 0.0
3 fruit Ziziphus https://upload.wikimedia.org/wikipedia/commons... C:\Users\v-tshuan\Programs\NimbusML-Samples\sa... https://express-tlcresources.azureedge.net/dat... 0.0 1.692611 0.000000 0.000000 0.0 ... 0.000000 1.000454 0.0 0.0 0.000000 9.258146 0.000000 0.0 0.0 0.0
4 fruit Papaya https://upload.wikimedia.org/wikipedia/commons... C:\Users\v-tshuan\Programs\NimbusML-Samples\sa... https://express-tlcresources.azureedge.net/dat... 0.0 0.000000 0.000000 0.000000 0.0 ... 0.000000 0.000000 0.0 0.0 0.523427 1.542979 0.000000 0.0 0.0 0.0

5 rows × 4101 columns

We can see that, for each image, 1000 (for 'Softmax') features were extracted. We then create a full pipeline with the kmeans clustering learner at the end.

# Creating full pipeline
pca = PcaTransformer(rank = 600, columns = ['Relu_1']) # Add PCA to reduce dimensionality 
kmeansplusplus = KMeansPlusPlus(n_clusters = 10, feature = ['Relu_1'])

ppl = Pipeline([loader, resizer, pix_extractor, dnn_featurizer, pca, kmeansplusplus]) 

Notice that for clustering methods, no label is needed. However, in NimbusML, the users are required to input a label column. Therefore, we use the 'class' column from the input data. This input is not used in the algorithm. #TODO: delete this part once the bug is fixed.

# Training pipeline
ppl.fit(df_train) #no y label should be required, need support serires

# Generating clustering result
result = ppl.predict(df_train);
Automatically adding a MinMax normalization transform, use 'norm=Warn' or 'norm=No' to turn this behavior off.
Initializing centroids
Centroids initialized, starting main trainer
Model trained successfully on 100 instances
Not training a calibrator because it is not needed.
Elapsed time: 00:01:15.6327614

The predicted cluster and the scores for each cluster are generated with the .predict() function.

result.head()
PredictedLabel Score.0 Score.1 Score.2 Score.3 Score.4 Score.5 Score.6 Score.7 Score.8 Score.9
0 8 47.443436 162.166412 233.463028 123.399414 206.769928 225.416779 222.125076 131.254929 25.495193 89.177101
1 1 102.845108 41.596313 105.146790 144.646729 121.247070 64.507538 162.006592 53.935581 73.810242 146.395020
2 0 35.323914 170.414337 248.145691 79.945480 197.487396 238.013458 262.897583 146.969086 62.751156 52.821465
3 1 260.027527 84.931091 101.239594 256.678467 149.556900 121.958710 266.289154 159.874390 248.470612 301.049408
4 1 121.527527 33.169621 55.411705 131.094177 101.599472 71.441795 134.034302 54.106560 118.840202 159.305237

Evaluation

We evaluate the clustering performance using the Dunn index (DI). A high DI indicates a set of compact clusters with small variance within clusters and long distances from clusters to clusters. It is calculated as the ratio of the minimum inter-cluster distance and the maximum diameter of a cluster, i.e. the maximum distance for the farthest two points inside the same cluster.

# plot clustering results
def plotClusters(results, df_train, label_name, plot_cluster_nums):
    figure_count = 0
    for plot_cluster_num in plot_cluster_nums:
        print("Cluster " + str(plot_cluster_num))
        image_files = list(df_train.loc[results[label_name] == plot_cluster_num]['ImagePath'])
        n_row = math.floor(math.sqrt(len(image_files)))
        n_col = math.ceil(len(image_files)/n_row)
        fig = plt.figure(figure_count)
        fig.canvas.set_window_title(str(plot_cluster_num))
        for i in range(len(image_files)):
            plt.subplot(n_row, n_col, i+1)
            plt.axis('off')
            plt.imshow(mpimg.imread(image_files[i]))
        figure_count += 1
        plt.show()

# computes the maximum pairwise distance within a cluster
def intraclusterDist(cluster_values):
    max_dist = 0.0 
    for i in range(len(cluster_values)):
        for j in range(len(cluster_values)):
            dist = np.linalg.norm(cluster_values[i]-cluster_values[j])
            if dist > max_dist:
                max_dist = dist
    return max_dist

# compute Dunn Index for the clustering results
def computeDunnIndex(features, labels, n_clusters):  
    cluster_centers = [np.mean(features.loc[labels == i]) for i in range(n_clusters)]
    index = float('inf')  
    max_intra_dist = 0.0
    # find maximum intracluster distance across all clusters
    for i in range(len(cluster_centers)):
        cluster_values = np.array(features.loc[labels == i])
        intracluster_d = float(intraclusterDist(cluster_values))
        if intracluster_d > max_intra_dist:
            max_intra_dist = intracluster_d

    # perform minimization of ratio
    for i in range(len(cluster_centers)):
        inner_min = float('inf')
        for j in range(len(cluster_centers)):
            if i != j:
                intercluster_d = float(np.linalg.norm(cluster_centers[i]-cluster_centers[j]))
                ratio = intercluster_d/max_intra_dist
                if ratio < inner_min:
                    inner_min = ratio
        if inner_min < index:
            index = inner_min
    return index, pd.DataFrame(cluster_centers)
# Compute the Dunn Index on the clustering results (omitting the first
# four columns because they contain text labels and filenames).
DI2, cluster_centers = computeDunnIndex(transformed.iloc[:,5:4098], result['PredictedLabel'], 10)
print('Dunn Index Value: ' + str(round(DI2,2)))
Dunn Index Value: 0.27
# We plot three clusters with ids 1, 2 and 3 from the result.
plotClusters(result, df_train, 'PredictedLabel', [1,2,3])
Cluster 1

png

Cluster 2

png

Cluster 3

png

Cluster Center

In this section, we visualize the cluster centers in a 2D plane,

# Dimensional reduction to 2D array using PCA
cluster_centers_2 = PcaTransformer(rank = 2, center = False, \
                                   columns = {'pca':list(cluster_centers.columns)}).fit_transform(cluster_centers)
cluster_centers_2
Relu_1.0 Relu_1.1 Relu_1.2 Relu_1.3 Relu_1.4 Relu_1.5 Relu_1.6 Relu_1.7 Relu_1.8 Relu_1.9 ... Relu_1.4085 Relu_1.4086 Relu_1.4087 Relu_1.4088 Relu_1.4089 Relu_1.4090 Relu_1.4091 Relu_1.4092 pca.0 pca.1
0 0.234133 1.046287 0.502215 0.339462 0.000000 0.003935 0.289966 0.136332 0.508433 0.344398 ... 0.688653 0.271286 0.000000 1.310596 1.447892 0.148210 0.139664 0.302114 38.958282 -35.167919
1 0.269257 0.472427 2.402555 0.099881 0.093357 1.066316 0.382131 0.372242 0.000000 0.000000 ... 0.000767 0.526597 1.998919 0.213225 0.000000 1.000880 0.832421 0.828464 44.349316 1.083801
2 0.000000 0.000000 1.955348 0.000000 0.000000 0.321844 0.000000 0.000000 0.000000 0.000000 ... 1.095373 1.540617 0.000000 0.000000 0.000000 0.000000 3.631396 0.000000 66.773453 33.720440
3 0.304907 0.000000 1.796412 1.330259 0.997015 1.088628 0.023826 0.154687 0.287960 0.474332 ... 0.303217 0.652466 0.054582 0.470122 0.268686 0.000000 0.076635 0.174167 35.454399 -31.305775
4 0.000000 5.685608 3.738891 0.000000 0.000000 0.000000 0.000000 1.349082 0.000000 1.009326 ... 0.000000 0.000000 7.092070 0.141207 0.000000 5.558882 0.560674 0.000000 40.822681 -26.462284
5 0.510085 0.000000 1.072526 0.000000 0.000000 0.571546 0.000000 0.000000 0.000000 0.000000 ... 0.000000 3.328192 0.886580 0.000000 0.000000 2.056593 0.234584 0.075302 54.611805 9.981750
6 0.000000 0.000000 1.090464 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.407407 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 64.245621 45.666008
7 0.187290 0.295880 0.803294 0.648739 0.292867 0.357802 0.000000 0.236788 0.224239 0.011484 ... 0.124612 1.912379 0.000000 0.303533 0.501163 0.000000 0.094828 0.040593 43.860363 8.995897
8 0.000000 0.000000 0.024775 0.210110 0.000000 0.000000 0.000000 0.350310 0.000000 0.118182 ... 0.913040 0.153423 0.000000 0.146180 0.822859 0.000000 0.023593 0.000000 36.963867 -25.331329
9 0.000000 0.000000 3.164072 0.914441 0.000000 0.098782 0.300779 0.000000 0.231882 0.000000 ... 0.451272 0.000000 0.000000 0.415952 0.000000 0.000000 0.000000 0.376113 34.661339 -45.246220

10 rows × 4095 columns

# Visualize
fig = plt.figure(1,figsize=(10,10))
plt.xlabel("X1")
plt.ylabel("X2")
for i in range(10):
    file_name = df_train['ImagePath'][np.argwhere(result['PredictedLabel'] == i)[0][0]]
    image = np.array(Image.open(file_name).resize((30,30)))
    figs = fig.figimage(image, (cluster_centers_2['pca.0'][i] - 30) * 20,(cluster_centers_2['pca.1'][i] + 60) * 3 )
    figs.set_zorder(20)
plt.show();
c:\users\v-tshuan\env\env\python37_general\lib\site-packages\numpy\core\fromnumeric.py:56: FutureWarning: Series.nonzero() is deprecated and will be removed in a future version.Use Series.to_numpy().nonzero() instead
  return getattr(obj, method)(*args, **kwds)

png