Support Questions
Find answers, ask questions, and share your expertise

How to print data after canopy clustering

Hi Experts,

Here you can find simple piece of code which I wrote:


import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.mahout.clustering.Cluster;
import org.apache.mahout.clustering.canopy.CanopyDriver;
import org.apache.mahout.clustering.classify.WeightedPropertyVectorWritable;
import org.apache.mahout.common.distance.EuclideanDistanceMeasure;
import org.apache.mahout.math.RandomAccessSparseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;


public class Clustering {

private final static String root = "C:\\root\\BI\\";
private final static String dataDir = root + "synthetic_control.data";
private final static String seqDir = root + "synthetic_control.seq";
private final static String outputDir = root + "output";
private final static String partMDir = outputDir + "\\"
+ Cluster.CLUSTERED_POINTS_DIR + "\\part-m-0";

private final static String SEPARATOR = " ";

private final static int NUMBER_OF_ELEMENTS = 2;

private Configuration conf;
private FileSystem fs;

public Clustering() throws IOException {
conf = new Configuration();
fs = FileSystem.get(conf);
}

public void convertToVectorFile() throws IOException {

BufferedReader reader = new BufferedReader(new FileReader(dataDir));
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
new Path(seqDir), LongWritable.class, VectorWritable.class);

String line;
long counter = 0;
while ((line = reader.readLine()) != null) {
String[] c;
c = line.split(SEPARATOR);
double[] d = new double[c.length];
for (int i = 0; i < NUMBER_OF_ELEMENTS; i++) {
try {
d[i] = Double.parseDouble(c[i]);

} catch (Exception ex) {
d[i] = 0;
}
}

Vector vec = new RandomAccessSparseVector(c.length);
vec.assign(d);

VectorWritable writable = new VectorWritable();
writable.set(vec);
writer.append(new LongWritable(counter++), writable);
}
writer.close();
}

public void createClusters(double t1, double t2,
double clusterClassificationThreshold, boolean runSequential)
throws ClassNotFoundException, IOException, InterruptedException {

EuclideanDistanceMeasure measure = new EuclideanDistanceMeasure();
Path inputPath = new Path(seqDir);
Path outputPath = new Path(outputDir);

CanopyDriver.run(inputPath, outputPath, measure, t1, t2, runSequential,
clusterClassificationThreshold, runSequential);
}

public void printClusters() throws IOException {
SequenceFile.Reader readerSequence = new SequenceFile.Reader(fs,
new Path(partMDir), conf);

IntWritable key = new IntWritable();
WeightedPropertyVectorWritable value = new WeightedPropertyVectorWritable();
while (readerSequence.next(key, value)) {
System.out.println(value.toString() + " belongs to cluster "
+ key.toString());
}
readerSequence.close();
}
}

Here we have got 3 different methods.

A. convertToVectorFile()

This function takes a file C:\root\BI\synthetic_control.data and converts it into another file (I was following book Mahout in Action ).

For file:

0.01 1.0
0.1 0.9
0.1 0.95
12.0 13.0
12.5 12.8

it generated for me the following structure:

>tree /F
C:.
.synthetic_control.seq.crc
synthetic_control.data
synthetic_control.seq

with log in Eclipse:

DEBUG Groups - Creating new Groups object
DEBUG Groups - Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
DEBUG UserGroupInformation - hadoop login
DEBUG UserGroupInformation - hadoop login commit
DEBUG UserGroupInformation - using local user:NTUserPrincipal : xxxxxxxx
DEBUG UserGroupInformation - UGI loginUserxxxxxxx
DEBUG FileSystem - Creating filesystem for file:///
DEBUG NativeCodeLoader - Trying to load the custom-built native-hadoop library...
DEBUG NativeCodeLoader - Failed to load native-hadoop with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
DEBUG NativeCodeLoader - java.library.path=C:\Program Files\Java\jre7\bin;C:\Windows\Sun\Java\bin;C:\Windows\system32;C:\Windows;C:\Program Files (x86)\Intel\iCLS Client\;C:\Program Files\Intel\iCLS Client\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x86;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x64;C:\Program Files\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files\MATLAB\R2009b\runtime\win64;C:\Program Files\MATLAB\R2009b\bin;C:\Program Files\TortoiseSVN\bin;C:\Users\xxxxxxxx\Documents\apache-maven-3.1.1\bin;.
WARN NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


B. createClusters()

Next method generates clusters. When I run it it gives me a log:

INFO CanopyDriver - Build Clusters Input: C:/Users/xxxxxxxx/Desktop/BI/synthetic_control.seq Out: C:/Users/xxxxxxxx/Desktop/BI/output Measure: org.apache.mahout.common.distance.EuclideanDistanceMeasure@2224ece4 t1: 2.0 t2: 3.0
DEBUG CanopyClusterer - Created new Canopy:0 at center:[0.010, 1.000]
DEBUG CanopyClusterer - Added point: [0.100, 0.900] to canopy: C-0
DEBUG CanopyClusterer - Added point: [0.100, 0.950] to canopy: C-0
DEBUG CanopyClusterer - Created new Canopy:1 at center:[12.000, 13.000]
DEBUG CanopyClusterer - Added point: [12.500, 12.800] to canopy: C-1
DEBUG CanopyDriver - Writing Canopy:C-0 center:[0.070, 0.950] numPoints:3 radius:[0.042, 0.041]
DEBUG CanopyDriver - Writing Canopy:C-1 center:[12.250, 12.900] numPoints:2 radius:[0.250, 0.100]
DEBUG FileSystem - Starting clear of FileSystem cache with 1 elements.
DEBUG FileSystem - Removing filesystem for file:///
DEBUG FileSystem - Removing filesystem for file:///
DEBUG FileSystem - Done clearing cache

and I can see more files in my directory:

>tree /F
C:.
│ .synthetic_control.seq.crc
│ synthetic_control.data
│ synthetic_control.seq

└───output
├───clusteredPoints
│ .part-m-0.crc
│ part-m-0

└───clusters-0-final
.part-r-00000.crc
._policy.crc
part-r-00000
_policy

Reading the log we can see that everything worked well. We have got 2 clusters with proper points.

C. printClusters()

Here is my problem.

I have no erros but I cannot see any results in console screen. My code never goes in while loop.

Thank you for any help

1 ACCEPTED SOLUTION

Accepted Solutions

Re: How to print data after canopy clustering

Master Collaborator

OK, is the file nonempty? I think the data is not in the format you expect then. From skimming the code, it looks like the output is Text + ClusterWritable, not IntWritable + WeightedPropertyVectorWritable.  You are trying to print the cluster centroids, right?

View solution in original post

8 REPLIES 8

Re: How to print data after canopy clustering

Master Collaborator

Do the files have data in them? I would double-check that they are not 0-length, but I doubt it. What directory do you find the files in? I suspect its name is like "part-m-00000" but your code appears to be listing "part-m-0"

Re: How to print data after canopy clustering

Hi,

 

In fact, I wax expecting file named part-m-00000.

 

Before I run my program, only file C:\root\BI\synthetic_control.data exists with data&colon;

 

0.01 1.0
0.1 0.9
0.1 0.95
12.0 13.0
12.5 12.8

 when I run method convertToVectorFile() I can see 2 new files:

 

│   .synthetic_control.seq.crc
│   synthetic_control.data
│   synthetic_control.seq

 


when I run method createClusters() I can see few new files:

 


│   .synthetic_control.seq.crc
│   synthetic_control.data
│   synthetic_control.seq

└───output
    ├───clusteredPoints
    │       .part-m-0.crc
    │       part-m-0
    │
    └───clusters-0-final
            .part-r-00000.crc
            ._policy.crc
            part-r-00000
            _policy

 


Because there is a lot of strange characters I uploaded these files here: All files


File part-m-00000 does not exist...

 

Thank you for your help

Re: How to print data after canopy clustering

Master Collaborator

Yeah that's why I'm confused here. Is it not hte part-r-00000 that likely has the data?

The format is a binary serialization and you can't open it as if it is a text file.

Re: How to print data after canopy clustering

Hi,

 

I changed this line:

 

private final static String partMDir = outputDir + "\\"
+ Cluster.CLUSTERED_POINTS_DIR + "\\part-m-0"; 

 

and now I have got:

 

private final static String partMDir = outputDir + "\\" + "clusters-0-final" + "\\part-r-00000";

 

When I run my code I have got an exception:

 

java.io.IOException: wrong value class: wt: 0.0  vec: null is not class org.apache.mahout.clustering.iterator.ClusterWritable
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1936)
	at com.my.packagee.bi.canopy.CanopyClustering.printClusters(CanopyClustering.java:129)
	at com.my.package.bi.BIManager.printClusters(BIManager.java:20)
	at com.my.package.bi.Main.main(Main.java:15)

 

which goes from line:

 

while (readerSequence.next(key, value)) {

 

I changed a little bit pom.xml file maybe there is some problems, f.e. with version.

 

                <mahout.version>0.9</mahout.version>
                
                <mahout.groupid>org.apache.mahout</mahout.groupid>

                <dependency>
                        <groupId>${mahout.groupid}</groupId>
                        <artifactId>mahout-core</artifactId>
                        <version>${mahout.version}</version>
                </dependency>
                <dependency>
                        <groupId>${mahout.groupid}</groupId>
                        <artifactId>mahout-core</artifactId>
                        <type>test-jar</type>
                        <scope>test</scope>
                        <version>${mahout.version}</version>
                </dependency>
                <dependency>
                        <groupId>${mahout.groupid}</groupId>
                        <artifactId>mahout-math</artifactId>
                        <version>${mahout.version}</version>
                </dependency>
                <dependency>
                        <groupId>${mahout.groupid}</groupId>
                        <artifactId>mahout-math</artifactId>
                        <type>test-jar</type>
                        <scope>test</scope>
                        <version>${mahout.version}</version>
                </dependency>
                <dependency>
                        <groupId>${mahout.groupid}</groupId>
                        <artifactId>mahout-examples</artifactId>
                        <version>${mahout.version}</version>
                </dependency>

 

Thank you in advance

 

Re: How to print data after canopy clustering

Master Collaborator

I think you have found the right file then, but it is saying that it did not generate cluster centers. Maybe the data is too small. This might be better as a question on the Mahout mailing list as to what that means.

Re: How to print data after canopy clustering

Thank you for your message.

 

I am not sure if you are right...

 

Here you can see full log:

 

DEBUG CanopyClusterer - Created new Canopy:0 at center:[0.010, 1.000]
DEBUG CanopyClusterer - Added point: [0.100, 0.900] to canopy: C-0
DEBUG CanopyClusterer - Added point: [0.100, 0.950] to canopy: C-0
DEBUG CanopyClusterer - Created new Canopy:1 at center:[12.000, 13.000]
DEBUG CanopyClusterer - Added point: [12.500, 12.800] to canopy: C-1
DEBUG CanopyDriver - Writing Canopy:C-0 center:[0.070, 0.950] numPoints:3 radius:[0.042, 0.041]
DEBUG CanopyDriver - Writing Canopy:C-1 center:[12.250, 12.900] numPoints:2 radius:[0.250, 0.100]

 So it seems to be ok...

Re: How to print data after canopy clustering

Master Collaborator

OK, is the file nonempty? I think the data is not in the format you expect then. From skimming the code, it looks like the output is Text + ClusterWritable, not IntWritable + WeightedPropertyVectorWritable.  You are trying to print the cluster centroids, right?

View solution in original post

Re: How to print data after canopy clustering

Thank you for your effort.

 

No, this file is not empty: here you can check it part-r-00000

 

I would like to see all the vectors with information about cluster for each.

 

It would be nice to see also centers of the clusters.

 

I changed

 

 

IntWritable key = new IntWritable();
WeightedPropertyVectorWritable value = new WeightedPropertyVectorWritable();

 to this

 

Text key = new Text();
ClusterWritable value = new ClusterWritable();

 I have not got any exception but the oputput is:

 

org.apache.mahout.clustering.iterator.ClusterWritable@572c4a12 belongs to cluster C-0
org.apache.mahout.clustering.iterator.ClusterWritable@572c4a12 belongs to cluster C-1

 ---

EDIT:

 

I changed

 

value.toString()

 to

 

value.getValue()

 and now, I have got an output:

 

C-0: {0:0.07,1:0.9499999999999998} belongs to cluster C-0
C-1: {0:12.25,1:12.9} belongs to cluster C-1

 

Thank you very much !!!!