Reply
Highlighted
New Contributor
Posts: 1
Registered: ‎06-03-2015

how to interpret the result of the clustering by "mahout kmeans"

 

I'm now calculating the clustering by using "mahout kmeans."
My first test data (sample.arff) is as follows:

 

@RELATION rfm
@ATTRIBUTE recency NUMERIC
@ATTRIBUTE frequency NUMERIC
@ATTRIBUTE money NUMERIC
@ATTRIBUTE location NUMERIC
@ATTRIBUTE position NUMERIC
@DATA
0.472,0.275,0.099,0.952,0.047,
0.000,0.824,0.936,0.214,0.000,
0.000,0.537,0.656,0.591,0.000,
....
0.908,0.000,0.000,0.078,0.136,
0.134,0.000,0.000,0.781,0.160,
0.302,0.000,0.000,0.513,0.715,
0.472,0.000,0.000,0.749,0.047,

The file format is ARFF.

Each row is the 5-dimensional vector and the most of these vectors contain zero values.

 

I converted the ARFF to the Vector format for the purpose of "mahout kmeans."

 

mahout arff.vector \
--input sample.arff \
--output sample.vector \
--dictOut sample.dict

 

The resultant file (sample.vector) is as follows:

 

Key: 0: Value: {0:0.472,1:0.275,2:0.099,3:0.952,4:0.047}
Key: 1: Value: {1:0.824,2:0.936,3:0.214}
Key: 2: Value: {1:0.537,2:0.656,3:0.591}
Key: 3: Value: {1:0.954,2:0.253,3:0.721}
Key: 4: Value: {1:0.187,2:0.735,3:0.782}
Key: 5: Value: {1:0.517,2:0.276,3:0.096}
Key: 6: Value: {1:0.189,2:0.127,3:0.517}
...
Key: 993: Value: {0:0.662,3:0.218,4:0.69}
Key: 994: Value: {0:0.56,3:0.682,4:0.153}
Key: 995: Value: {0:0.788,3:0.929,4:0.967}
Key: 996: Value: {0:0.908,3:0.078,4:0.136}
Key: 997: Value: {0:0.134,3:0.781,4:0.16}
Key: 998: Value: {0:0.302,3:0.513,4:0.715}
Key: 999: Value: {0:0.472,3:0.749,4:0.047}


Each vector is represented by the dictionary format.

Using the file, I carried out "mahout kmeans."

 

mahout kmeans \
--input sample.vector \
--output kmeans-output \
--maxIter 5 \
--numClusters 10 \
--clusters null-cluster \
--clustering \
-dm org.apache.mahout.common.distance.EuclideanDistanceMeasure

After typing the following command,

mahout clusterdump \
        --input kmeans-output/clusters-1-final \
        --output kmeans-dump.text

I got the result (kmeans-dump.text) shown below:

 

VL-648{n=172 c=[0.733, 0.608, 0.563] r=[0.168, 0.221, 0.235]}
VL-677{n=57 c=[0.445, 0.145, 0.839] r=[0.271, 0.099, 0.097]}
VL-429{n=40 c=[0.117, 0.768, 0.674] r=[0.078, 0.156, 0.159]}
VL-801{n=92 c=[0.318, 0.016, 0.007, 0.810, 0.191] r=[0.238, 0.060, 0.023, 0.137, 0.155]}
VL-322{n=55 c=[0.605, 0.872, 0.380] r=[0.217, 0.083, 0.204]}
VL-725{n=88 c=[0.351, 0.559, 0.760] r=[0.197, 0.206, 0.153]}
VL-197{n=176 c=[0.500, 0.482, 0.774] r=[0.264, 0.260, 0.141]}
VL-438{n=159 c=[0.618, 0.351, 0.288] r=[0.215, 0.203, 0.163]}
VL-58{n=54 c=[0.157, 0.515, 0.211] r=[0.102, 0.229, 0.143]}
VL-971{n=117 c=[0.339, 0.014, 0.007, 0.195, 0.282] r=[0.252, 0.052, 0.025, 0.133, 0.192]}

Each row presents 

  1. c : the coordinate of the centroid of the cluster
  2. r : the radius of the cluster
  3. n : the number of the elements contained in the cluster

Though I expected all centroids to be 5-dimensional vectors, they are not so.
One vector has 5 elements, the other one has 3 elements.

Cloud you tell me how to interpret the result?

Announcements