About mahoutmaster

mahoutmaster · ‎05-21-2014

Thank you for your effort. No, this file is not empty: here you can check it part-r-00000 I would like to see all the vectors with information about cluster for each. It would be nice to see also centers of the clusters. I changed IntWritable key = new IntWritable(); WeightedPropertyVectorWritable value = new WeightedPropertyVectorWritable(); to this Text key = new Text(); ClusterWritable value = new ClusterWritable(); I have not got any exception but the oputput is: org.apache.mahout.clustering.iterator.ClusterWritable@572c4a12 belongs to cluster C-0 org.apache.mahout.clustering.iterator.ClusterWritable@572c4a12 belongs to cluster C-1 --- EDIT: I changed value.toString() to value.getValue() and now, I have got an output: C-0: {0:0.07,1:0.9499999999999998} belongs to cluster C-0 C-1: {0:12.25,1:12.9} belongs to cluster C-1 Thank you very much !!!!

mahoutmaster · ‎05-21-2014

Thank you for your message. I am not sure if you are right... Here you can see full log: DEBUG CanopyClusterer - Created new Canopy:0 at center:[0.010, 1.000] DEBUG CanopyClusterer - Added point: [0.100, 0.900] to canopy: C-0 DEBUG CanopyClusterer - Added point: [0.100, 0.950] to canopy: C-0 DEBUG CanopyClusterer - Created new Canopy:1 at center:[12.000, 13.000] DEBUG CanopyClusterer - Added point: [12.500, 12.800] to canopy: C-1 DEBUG CanopyDriver - Writing Canopy:C-0 center:[0.070, 0.950] numPoints:3 radius:[0.042, 0.041] DEBUG CanopyDriver - Writing Canopy:C-1 center:[12.250, 12.900] numPoints:2 radius:[0.250, 0.100] So it seems to be ok...

mahoutmaster · ‎05-21-2014

Hi, I changed this line: private final static String partMDir = outputDir + "\\" + Cluster.CLUSTERED_POINTS_DIR + "\\part-m-0"; and now I have got: private final static String partMDir = outputDir + "\\" + "clusters-0-final" + "\\part-r-00000"; When I run my code I have got an exception: java.io.IOException: wrong value class: wt: 0.0 vec: null is not class org.apache.mahout.clustering.iterator.ClusterWritable at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1936) at com.my.packagee.bi.canopy.CanopyClustering.printClusters(CanopyClustering.java:129) at com.my.package.bi.BIManager.printClusters(BIManager.java:20) at com.my.package.bi.Main.main(Main.java:15) which goes from line: while (readerSequence.next(key, value)) { I changed a little bit pom.xml file maybe there is some problems, f.e. with version. <mahout.version>0.9</mahout.version> <mahout.groupid>org.apache.mahout</mahout.groupid> <dependency> <groupId>${mahout.groupid}</groupId> <artifactId>mahout-core</artifactId> <version>${mahout.version}</version> </dependency> <dependency> <groupId>${mahout.groupid}</groupId> <artifactId>mahout-core</artifactId> <type>test-jar</type> <scope>test</scope> <version>${mahout.version}</version> </dependency> <dependency> <groupId>${mahout.groupid}</groupId> <artifactId>mahout-math</artifactId> <version>${mahout.version}</version> </dependency> <dependency> <groupId>${mahout.groupid}</groupId> <artifactId>mahout-math</artifactId> <type>test-jar</type> <scope>test</scope> <version>${mahout.version}</version> </dependency> <dependency> <groupId>${mahout.groupid}</groupId> <artifactId>mahout-examples</artifactId> <version>${mahout.version}</version> </dependency> Thank you in advance

mahoutmaster · ‎05-21-2014

Hi, In fact, I wax expecting file named part-m-00000. Before I run my program, only file C:\root\BI\synthetic_control.data exists with data&colon; 0.01 1.0 0.1 0.9 0.1 0.95 12.0 13.0 12.5 12.8 when I run method convertToVectorFile() I can see 2 new files: │ .synthetic_control.seq.crc │ synthetic_control.data │ synthetic_control.seq when I run method createClusters() I can see few new files: │ .synthetic_control.seq.crc │ synthetic_control.data │ synthetic_control.seq │ └───output ├───clusteredPoints │ .part-m-0.crc │ part-m-0 │ └───clusters-0-final .part-r-00000.crc ._policy.crc part-r-00000 _policy Because there is a lot of strange characters I uploaded these files here: All files File part-m-00000 does not exist... Thank you for your help

mahoutmaster · ‎05-21-2014

Hi Experts, Here you can find simple piece of code which I wrote: import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.SequenceFile; import org.apache.mahout.clustering.Cluster; import org.apache.mahout.clustering.canopy.CanopyDriver; import org.apache.mahout.clustering.classify.WeightedPropertyVectorWritable; import org.apache.mahout.common.distance.EuclideanDistanceMeasure; import org.apache.mahout.math.RandomAccessSparseVector; import org.apache.mahout.math.Vector; import org.apache.mahout.math.VectorWritable; public class Clustering { private final static String root = "C:\\root\\BI\\"; private final static String dataDir = root + "synthetic_control.data"; private final static String seqDir = root + "synthetic_control.seq"; private final static String outputDir = root + "output"; private final static String partMDir = outputDir + "\\" + Cluster.CLUSTERED_POINTS_DIR + "\\part-m-0"; private final static String SEPARATOR = " "; private final static int NUMBER_OF_ELEMENTS = 2; private Configuration conf; private FileSystem fs; public Clustering() throws IOException { conf = new Configuration(); fs = FileSystem.get(conf); } public void convertToVectorFile() throws IOException { BufferedReader reader = new BufferedReader(new FileReader(dataDir)); SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, new Path(seqDir), LongWritable.class, VectorWritable.class); String line; long counter = 0; while ((line = reader.readLine()) != null) { String[] c; c = line.split(SEPARATOR); double[] d = new double[c.length]; for (int i = 0; i < NUMBER_OF_ELEMENTS; i++) { try { d[i] = Double.parseDouble(c[i]); } catch (Exception ex) { d[i] = 0; } } Vector vec = new RandomAccessSparseVector(c.length); vec.assign(d); VectorWritable writable = new VectorWritable(); writable.set(vec); writer.append(new LongWritable(counter++), writable); } writer.close(); } public void createClusters(double t1, double t2, double clusterClassificationThreshold, boolean runSequential) throws ClassNotFoundException, IOException, InterruptedException { EuclideanDistanceMeasure measure = new EuclideanDistanceMeasure(); Path inputPath = new Path(seqDir); Path outputPath = new Path(outputDir); CanopyDriver.run(inputPath, outputPath, measure, t1, t2, runSequential, clusterClassificationThreshold, runSequential); } public void printClusters() throws IOException { SequenceFile.Reader readerSequence = new SequenceFile.Reader(fs, new Path(partMDir), conf); IntWritable key = new IntWritable(); WeightedPropertyVectorWritable value = new WeightedPropertyVectorWritable(); while (readerSequence.next(key, value)) { System.out.println(value.toString() + " belongs to cluster " + key.toString()); } readerSequence.close(); } } Here we have got 3 different methods. A. convertToVectorFile() This function takes a file C:\root\BI\synthetic_control.data and converts it into another file (I was following book Mahout in Action ). For file: 0.01 1.0 0.1 0.9 0.1 0.95 12.0 13.0 12.5 12.8 it generated for me the following structure: >tree /F C:. .synthetic_control.seq.crc synthetic_control.data synthetic_control.seq with log in Eclipse: DEBUG Groups - Creating new Groups object DEBUG Groups - Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000 DEBUG UserGroupInformation - hadoop login DEBUG UserGroupInformation - hadoop login commit DEBUG UserGroupInformation - using local user:NTUserPrincipal : xxxxxxxx DEBUG UserGroupInformation - UGI loginUserxxxxxxx DEBUG FileSystem - Creating filesystem for file:/// DEBUG NativeCodeLoader - Trying to load the custom-built native-hadoop library... DEBUG NativeCodeLoader - Failed to load native-hadoop with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path DEBUG NativeCodeLoader - java.library.path=C:\Program Files\Java\jre7\bin;C:\Windows\Sun\Java\bin;C:\Windows\system32;C:\Windows;C:\Program Files (x86)\Intel\iCLS Client\;C:\Program Files\Intel\iCLS Client\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x86;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x64;C:\Program Files\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files\MATLAB\R2009b\runtime\win64;C:\Program Files\MATLAB\R2009b\bin;C:\Program Files\TortoiseSVN\bin;C:\Users\xxxxxxxx\Documents\apache-maven-3.1.1\bin;. WARN NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable B. createClusters() Next method generates clusters. When I run it it gives me a log: INFO CanopyDriver - Build Clusters Input: C:/Users/xxxxxxxx/Desktop/BI/synthetic_control.seq Out: C:/Users/xxxxxxxx/Desktop/BI/output Measure: org.apache.mahout.common.distance.EuclideanDistanceMeasure@2224ece4 t1: 2.0 t2: 3.0 DEBUG CanopyClusterer - Created new Canopy:0 at center:[0.010, 1.000] DEBUG CanopyClusterer - Added point: [0.100, 0.900] to canopy: C-0 DEBUG CanopyClusterer - Added point: [0.100, 0.950] to canopy: C-0 DEBUG CanopyClusterer - Created new Canopy:1 at center:[12.000, 13.000] DEBUG CanopyClusterer - Added point: [12.500, 12.800] to canopy: C-1 DEBUG CanopyDriver - Writing Canopy:C-0 center:[0.070, 0.950] numPoints:3 radius:[0.042, 0.041] DEBUG CanopyDriver - Writing Canopy:C-1 center:[12.250, 12.900] numPoints:2 radius:[0.250, 0.100] DEBUG FileSystem - Starting clear of FileSystem cache with 1 elements. DEBUG FileSystem - Removing filesystem for file:/// DEBUG FileSystem - Removing filesystem for file:/// DEBUG FileSystem - Done clearing cache and I can see more files in my directory: >tree /F C:. │ .synthetic_control.seq.crc │ synthetic_control.data │ synthetic_control.seq │ └───output ├───clusteredPoints │ .part-m-0.crc │ part-m-0 │ └───clusters-0-final .part-r-00000.crc ._policy.crc part-r-00000 _policy Reading the log we can see that everything worked well. We have got 2 clusters with proper points. C. printClusters() Here is my problem. I have no erros but I cannot see any results in console screen. My code never goes in while loop. Thank you for any help

Online	Offline
Last Visited	‎05-22-2014 11:52 AM

Member Since	‎05-21-2014 12:48 AM
Last Visited	‎05-22-2014 11:52 AM
Posts	9

Cloudera Community

Re: How to print data after canopy clustering

Re: How to print data after canopy clustering

Re: How to print data after canopy clustering

Re: How to print data after canopy clustering

How to print data after canopy clustering