Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

how to interpret the result of the clustering by "mahout kmeans"

how to interpret the result of the clustering by "mahout kmeans"

New Contributor

 

I'm now calculating the clustering by using "mahout kmeans."
My first test data (sample.arff) is as follows:

 

@RELATION rfm
@ATTRIBUTE recency NUMERIC
@ATTRIBUTE frequency NUMERIC
@ATTRIBUTE money NUMERIC
@ATTRIBUTE location NUMERIC
@ATTRIBUTE position NUMERIC
@DATA
0.472,0.275,0.099,0.952,0.047,
0.000,0.824,0.936,0.214,0.000,
0.000,0.537,0.656,0.591,0.000,
....
0.908,0.000,0.000,0.078,0.136,
0.134,0.000,0.000,0.781,0.160,
0.302,0.000,0.000,0.513,0.715,
0.472,0.000,0.000,0.749,0.047,

The file format is ARFF.

Each row is the 5-dimensional vector and the most of these vectors contain zero values.

 

I converted the ARFF to the Vector format for the purpose of "mahout kmeans."

 

mahout arff.vector \
--input sample.arff \
--output sample.vector \
--dictOut sample.dict

 

The resultant file (sample.vector) is as follows:

 

Key: 0: Value: {0:0.472,1:0.275,2:0.099,3:0.952,4:0.047}
Key: 1: Value: {1:0.824,2:0.936,3:0.214}
Key: 2: Value: {1:0.537,2:0.656,3:0.591}
Key: 3: Value: {1:0.954,2:0.253,3:0.721}
Key: 4: Value: {1:0.187,2:0.735,3:0.782}
Key: 5: Value: {1:0.517,2:0.276,3:0.096}
Key: 6: Value: {1:0.189,2:0.127,3:0.517}
...
Key: 993: Value: {0:0.662,3:0.218,4:0.69}
Key: 994: Value: {0:0.56,3:0.682,4:0.153}
Key: 995: Value: {0:0.788,3:0.929,4:0.967}
Key: 996: Value: {0:0.908,3:0.078,4:0.136}
Key: 997: Value: {0:0.134,3:0.781,4:0.16}
Key: 998: Value: {0:0.302,3:0.513,4:0.715}
Key: 999: Value: {0:0.472,3:0.749,4:0.047}


Each vector is represented by the dictionary format.

Using the file, I carried out "mahout kmeans."

 

mahout kmeans \
--input sample.vector \
--output kmeans-output \
--maxIter 5 \
--numClusters 10 \
--clusters null-cluster \
--clustering \
-dm org.apache.mahout.common.distance.EuclideanDistanceMeasure

After typing the following command,

mahout clusterdump \
        --input kmeans-output/clusters-1-final \
        --output kmeans-dump.text

I got the result (kmeans-dump.text) shown below:

 

VL-648{n=172 c=[0.733, 0.608, 0.563] r=[0.168, 0.221, 0.235]}
VL-677{n=57 c=[0.445, 0.145, 0.839] r=[0.271, 0.099, 0.097]}
VL-429{n=40 c=[0.117, 0.768, 0.674] r=[0.078, 0.156, 0.159]}
VL-801{n=92 c=[0.318, 0.016, 0.007, 0.810, 0.191] r=[0.238, 0.060, 0.023, 0.137, 0.155]}
VL-322{n=55 c=[0.605, 0.872, 0.380] r=[0.217, 0.083, 0.204]}
VL-725{n=88 c=[0.351, 0.559, 0.760] r=[0.197, 0.206, 0.153]}
VL-197{n=176 c=[0.500, 0.482, 0.774] r=[0.264, 0.260, 0.141]}
VL-438{n=159 c=[0.618, 0.351, 0.288] r=[0.215, 0.203, 0.163]}
VL-58{n=54 c=[0.157, 0.515, 0.211] r=[0.102, 0.229, 0.143]}
VL-971{n=117 c=[0.339, 0.014, 0.007, 0.195, 0.282] r=[0.252, 0.052, 0.025, 0.133, 0.192]}

Each row presents 

  1. c : the coordinate of the centroid of the cluster
  2. r : the radius of the cluster
  3. n : the number of the elements contained in the cluster

Though I expected all centroids to be 5-dimensional vectors, they are not so.
One vector has 5 elements, the other one has 3 elements.

Cloud you tell me how to interpret the result?