Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to print data after canopy clustering

Solved Go to solution

How to print data after canopy clustering

Hi Experts,

Here you can find simple piece of code which I wrote:


import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.mahout.clustering.Cluster;
import org.apache.mahout.clustering.canopy.CanopyDriver;
import org.apache.mahout.clustering.classify.WeightedPropertyVectorWritable;
import org.apache.mahout.common.distance.EuclideanDistanceMeasure;
import org.apache.mahout.math.RandomAccessSparseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;


public class Clustering {

private final static String root = "C:\\root\\BI\\";
private final static String dataDir = root + "synthetic_control.data";
private final static String seqDir = root + "synthetic_control.seq";
private final static String outputDir = root + "output";
private final static String partMDir = outputDir + "\\"
+ Cluster.CLUSTERED_POINTS_DIR + "\\part-m-0";

private final static String SEPARATOR = " ";

private final static int NUMBER_OF_ELEMENTS = 2;

private Configuration conf;
private FileSystem fs;

public Clustering() throws IOException {
conf = new Configuration();
fs = FileSystem.get(conf);
}

public void convertToVectorFile() throws IOException {

BufferedReader reader = new BufferedReader(new FileReader(dataDir));
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
new Path(seqDir), LongWritable.class, VectorWritable.class);

String line;
long counter = 0;
while ((line = reader.readLine()) != null) {
String[] c;
c = line.split(SEPARATOR);
double[] d = new double[c.length];
for (int i = 0; i < NUMBER_OF_ELEMENTS; i++) {
try {
d[i] = Double.parseDouble(c[i]);

} catch (Exception ex) {
d[i] = 0;
}
}

Vector vec = new RandomAccessSparseVector(c.length);
vec.assign(d);

VectorWritable writable = new VectorWritable();
writable.set(vec);
writer.append(new LongWritable(counter++), writable);
}
writer.close();
}

public void createClusters(double t1, double t2,
double clusterClassificationThreshold, boolean runSequential)
throws ClassNotFoundException, IOException, InterruptedException {

EuclideanDistanceMeasure measure = new EuclideanDistanceMeasure();
Path inputPath = new Path(seqDir);
Path outputPath = new Path(outputDir);

CanopyDriver.run(inputPath, outputPath, measure, t1, t2, runSequential,
clusterClassificationThreshold, runSequential);
}

public void printClusters() throws IOException {
SequenceFile.Reader readerSequence = new SequenceFile.Reader(fs,
new Path(partMDir), conf);

IntWritable key = new IntWritable();
WeightedPropertyVectorWritable value = new WeightedPropertyVectorWritable();
while (readerSequence.next(key, value)) {
System.out.println(value.toString() + " belongs to cluster "
+ key.toString());
}
readerSequence.close();
}
}

Here we have got 3 different methods.

A. convertToVectorFile()

This function takes a file C:\root\BI\synthetic_control.data and converts it into another file (I was following book Mahout in Action ).

For file:

0.01 1.0
0.1 0.9
0.1 0.95
12.0 13.0
12.5 12.8

it generated for me the following structure:

>tree /F
C:.
.synthetic_control.seq.crc
synthetic_control.data
synthetic_control.seq

with log in Eclipse:

DEBUG Groups - Creating new Groups object
DEBUG Groups - Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
DEBUG UserGroupInformation - hadoop login
DEBUG UserGroupInformation - hadoop login commit
DEBUG UserGroupInformation - using local user:NTUserPrincipal : xxxxxxxx
DEBUG UserGroupInformation - UGI loginUserxxxxxxx
DEBUG FileSystem - Creating filesystem for file:///
DEBUG NativeCodeLoader - Trying to load the custom-built native-hadoop library...
DEBUG NativeCodeLoader - Failed to load native-hadoop with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
DEBUG NativeCodeLoader - java.library.path=C:\Program Files\Java\jre7\bin;C:\Windows\Sun\Java\bin;C:\Windows\system32;C:\Windows;C:\Program Files (x86)\Intel\iCLS Client\;C:\Program Files\Intel\iCLS Client\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x86;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x64;C:\Program Files\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files\MATLAB\R2009b\runtime\win64;C:\Program Files\MATLAB\R2009b\bin;C:\Program Files\TortoiseSVN\bin;C:\Users\xxxxxxxx\Documents\apache-maven-3.1.1\bin;.
WARN NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


B. createClusters()

Next method generates clusters. When I run it it gives me a log:

INFO CanopyDriver - Build Clusters Input: C:/Users/xxxxxxxx/Desktop/BI/synthetic_control.seq Out: C:/Users/xxxxxxxx/Desktop/BI/output Measure: org.apache.mahout.common.distance.EuclideanDistanceMeasure@2224ece4 t1: 2.0 t2: 3.0
DEBUG CanopyClusterer - Created new Canopy:0 at center:[0.010, 1.000]
DEBUG CanopyClusterer - Added point: [0.100, 0.900] to canopy: C-0
DEBUG CanopyClusterer - Added point: [0.100, 0.950] to canopy: C-0
DEBUG CanopyClusterer - Created new Canopy:1 at center:[12.000, 13.000]
DEBUG CanopyClusterer - Added point: [12.500, 12.800] to canopy: C-1
DEBUG CanopyDriver - Writing Canopy:C-0 center:[0.070, 0.950] numPoints:3 radius:[0.042, 0.041]
DEBUG CanopyDriver - Writing Canopy:C-1 center:[12.250, 12.900] numPoints:2 radius:[0.250, 0.100]
DEBUG FileSystem - Starting clear of FileSystem cache with 1 elements.
DEBUG FileSystem - Removing filesystem for file:///
DEBUG FileSystem - Removing filesystem for file:///
DEBUG FileSystem - Done clearing cache

and I can see more files in my directory:

>tree /F
C:.
│ .synthetic_control.seq.crc
│ synthetic_control.data
│ synthetic_control.seq

└───output
├───clusteredPoints
│ .part-m-0.crc
│ part-m-0

└───clusters-0-final
.part-r-00000.crc
._policy.crc
part-r-00000
_policy

Reading the log we can see that everything worked well. We have got 2 clusters with proper points.

C. printClusters()

Here is my problem.

I have no erros but I cannot see any results in console screen. My code never goes in while loop.

Thank you for any help

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: How to print data after canopy clustering

Master Collaborator

OK, is the file nonempty? I think the data is not in the format you expect then. From skimming the code, it looks like the output is Text + ClusterWritable, not IntWritable + WeightedPropertyVectorWritable.  You are trying to print the cluster centroids, right?

View solution in original post

8 REPLIES 8
Highlighted

Re: How to print data after canopy clustering

Master Collaborator

Do the files have data in them? I would double-check that they are not 0-length, but I doubt it. What directory do you find the files in? I suspect its name is like "part-m-00000" but your code appears to be listing "part-m-0"

Highlighted

Re: How to print data after canopy clustering

Hi,

 

In fact, I wax expecting file named part-m-00000.

 

Before I run my program, only file C:\root\BI\synthetic_control.data exists with data&colon;

 

0.01 1.0
0.1 0.9
0.1 0.95
12.0 13.0
12.5 12.8

 when I run method convertToVectorFile() I can see 2 new files:

 

│   .synthetic_control.seq.crc
│   synthetic_control.data
│   synthetic_control.seq

 


when I run method createClusters() I can see few new files:

 


│   .synthetic_control.seq.crc
│   synthetic_control.data
│   synthetic_control.seq

└───output
    ├───clusteredPoints
    │       .part-m-0.crc
    │       part-m-0
    │
    └───clusters-0-final
            .part-r-00000.crc
            ._policy.crc
            part-r-00000
            _policy

 


Because there is a lot of strange characters I uploaded these files here: All files


File part-m-00000 does not exist...

 

Thank you for your help

Highlighted

Re: How to print data after canopy clustering

Master Collaborator

Yeah that's why I'm confused here. Is it not hte part-r-00000 that likely has the data?

The format is a binary serialization and you can't open it as if it is a text file.

Highlighted

Re: How to print data after canopy clustering

Hi,

 

I changed this line:

 

private final static String partMDir = outputDir + "\\"
+ Cluster.CLUSTERED_POINTS_DIR + "\\part-m-0"; 

 

and now I have got:

 

private final static String partMDir = outputDir + "\\" + "clusters-0-final" + "\\part-r-00000";

 

When I run my code I have got an exception:

 

java.io.IOException: wrong value class: wt: 0.0  vec: null is not class org.apache.mahout.clustering.iterator.ClusterWritable
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1936)
	at com.my.packagee.bi.canopy.CanopyClustering.printClusters(CanopyClustering.java:129)
	at com.my.package.bi.BIManager.printClusters(BIManager.java:20)
	at com.my.package.bi.Main.main(Main.java:15)

 

which goes from line:

 

while (readerSequence.next(key, value)) {

 

I changed a little bit pom.xml file maybe there is some problems, f.e. with version.

 

                <mahout.version>0.9</mahout.version>
                
                <mahout.groupid>org.apache.mahout</mahout.groupid>

                <dependency>
                        <groupId>${mahout.groupid}</groupId>
                        <artifactId>mahout-core</artifactId>
                        <version>${mahout.version}</version>
                </dependency>
                <dependency>
                        <groupId>${mahout.groupid}</groupId>
                        <artifactId>mahout-core</artifactId>
                        <type>test-jar</type>
                        <scope>test</scope>
                        <version>${mahout.version}</version>
                </dependency>
                <dependency>
                        <groupId>${mahout.groupid}</groupId>
                        <artifactId>mahout-math</artifactId>
                        <version>${mahout.version}</version>
                </dependency>
                <dependency>
                        <groupId>${mahout.groupid}</groupId>
                        <artifactId>mahout-math</artifactId>
                        <type>test-jar</type>
                        <scope>test</scope>
                        <version>${mahout.version}</version>
                </dependency>
                <dependency>
                        <groupId>${mahout.groupid}</groupId>
                        <artifactId>mahout-examples</artifactId>
                        <version>${mahout.version}</version>
                </dependency>

 

Thank you in advance

 

Highlighted

Re: How to print data after canopy clustering

Master Collaborator

I think you have found the right file then, but it is saying that it did not generate cluster centers. Maybe the data is too small. This might be better as a question on the Mahout mailing list as to what that means.

Highlighted

Re: How to print data after canopy clustering

Thank you for your message.

 

I am not sure if you are right...

 

Here you can see full log:

 

DEBUG CanopyClusterer - Created new Canopy:0 at center:[0.010, 1.000]
DEBUG CanopyClusterer - Added point: [0.100, 0.900] to canopy: C-0
DEBUG CanopyClusterer - Added point: [0.100, 0.950] to canopy: C-0
DEBUG CanopyClusterer - Created new Canopy:1 at center:[12.000, 13.000]
DEBUG CanopyClusterer - Added point: [12.500, 12.800] to canopy: C-1
DEBUG CanopyDriver - Writing Canopy:C-0 center:[0.070, 0.950] numPoints:3 radius:[0.042, 0.041]
DEBUG CanopyDriver - Writing Canopy:C-1 center:[12.250, 12.900] numPoints:2 radius:[0.250, 0.100]

 So it seems to be ok...

Highlighted

Re: How to print data after canopy clustering

Master Collaborator

OK, is the file nonempty? I think the data is not in the format you expect then. From skimming the code, it looks like the output is Text + ClusterWritable, not IntWritable + WeightedPropertyVectorWritable.  You are trying to print the cluster centroids, right?

View solution in original post

Highlighted

Re: How to print data after canopy clustering

Thank you for your effort.

 

No, this file is not empty: here you can check it part-r-00000

 

I would like to see all the vectors with information about cluster for each.

 

It would be nice to see also centers of the clusters.

 

I changed

 

 

IntWritable key = new IntWritable();
WeightedPropertyVectorWritable value = new WeightedPropertyVectorWritable();

 to this

 

Text key = new Text();
ClusterWritable value = new ClusterWritable();

 I have not got any exception but the oputput is:

 

org.apache.mahout.clustering.iterator.ClusterWritable@572c4a12 belongs to cluster C-0
org.apache.mahout.clustering.iterator.ClusterWritable@572c4a12 belongs to cluster C-1

 ---

EDIT:

 

I changed

 

value.toString()

 to

 

value.getValue()

 and now, I have got an output:

 

C-0: {0:0.07,1:0.9499999999999998} belongs to cluster C-0
C-1: {0:12.25,1:12.9} belongs to cluster C-1

 

Thank you very much !!!!

Don't have an account?
Coming from Hortonworks? Activate your account here