Support Questions

emad_m_refai · ‎07-19-2016

I have a VectorWritable (org.apache.mahout.math.VectorWritable) which is coming from a sequence file generated by Mahout something like the following.

publicvoid write(List<Vector> points,int clustersNumber,HdfsConnector connector)throwsIOException{this.writePointsToFile(newPath(connector.getPointsInput(),"pointsInput"), connector.getFs(), connector.getConf(), points);Path clusterCentroids =newPath(connector.getClustersInput(),"part-0");SequenceFile.Writer writer =SequenceFile.createWriter(
            connector.getConf(),Writer.file(clusterCentroids),Writer.keyClass(Text.class),Writer.valueClass(Kluster.class));List<Vector> centroids = getCentroids;for(int i =0; i < centroids.size(); i++){Vector vect = centroids.get(i);Kluster centroidCluster =newKluster(vect, i,newSquaredEuclideanDistanceMeasure());
        writer.append(newText((centroidCluster).getIdentifier()),
                centroidCluster);}
    writer.close();}

and I would like to convert that into Vector (org.apache.spark.mllib.linalg.Vectors) type Spark as JavaRDD<Vector> How can I do that in Java ?

I've read something about sequenceFile in Spark but I couldn't figure out how to do it.

mgaido · ‎07-20-2016

Tou can convert a org.apache.mahout.math.Vector into a org.apache.spark.mllib.linalg.Vector by using the iterateNonZero() or iterateAll() methods of org.apache.mahout.math.Vector.

In fact, if you Vector is sparse the first option is the best. In this case you can build two arrays via the iterateNonZero: one containing all the non-zero indexes and the other with the corresponding values, i.e.

ArrayList<Double> values = new ArrayList<Double>();
ArrayList<Integer> indexes = new ArrayList<Integer>();
org.apache.mahout.math.Vector v = ...
Iterator<Element> it = v.iterateNonZero();
while(it.hasNext()){
	Element e = it.next();
	values.add(e.get());
	indexes.add(e.index());
}
Vectors.sparse(v.size(), indexes.toArray(new Integer[indexes.size()]) ,values.toArray(new Double[values.size()]));

You can do the same thing if you have a dense Vector using the iterateAll() method and Vectors.dense.

View solution in original post

mgaido · ‎07-20-2016

Tou can convert a org.apache.mahout.math.Vector into a org.apache.spark.mllib.linalg.Vector by using the iterateNonZero() or iterateAll() methods of org.apache.mahout.math.Vector.

In fact, if you Vector is sparse the first option is the best. In this case you can build two arrays via the iterateNonZero: one containing all the non-zero indexes and the other with the corresponding values, i.e.

ArrayList<Double> values = new ArrayList<Double>();
ArrayList<Integer> indexes = new ArrayList<Integer>();
org.apache.mahout.math.Vector v = ...
Iterator<Element> it = v.iterateNonZero();
while(it.hasNext()){
	Element e = it.next();
	values.add(e.get());
	indexes.add(e.index());
}
Vectors.sparse(v.size(), indexes.toArray(new Integer[indexes.size()]) ,values.toArray(new Double[values.size()]));

You can do the same thing if you have a dense Vector using the iterateAll() method and Vectors.dense.

emad_m_refai · ‎07-20-2016

@Marco Gaido thank you for you answer it's really helpful,

Can you please tell me how to store the vectors to the HDFS after converting them

and then read them from the HDFS to use them in Spark kmean for clustering

as KMeansModel clusters = KMeans.train

mgaido · ‎07-20-2016

The easiest way is to use the method saveAsObjectFile and read it through the objectFile method... You can easily find them in Spark documentation for further details about them.

Cloudera Community

Support Questions

In Java Convert Mahout Vector to Spark Vector

Spark and Java versions Supportability Matrix

Java Read and Write Spark Vector's to Hdfs

Vectorized query execution for parquet tables

Spark 3 legacy configurations list ( Spark 2 behav...

Vectorization Causing a Select COUNT(1) Query to F...

Spark Python Supportability Matrix

Retrieve and modify latent feature vectors on the ...

Spark Scala Version Compatibility Matrix

Convert Spark Pipeline TFIDF Model Into MLeap Bund...

Hive Insert query to table(have partitions) failin...