Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hashing trick

Hashing trick

Hello,

I am dealing with genetic data and I would like to perform the hashing trick to reduce the number of features. I have written the following code:

val unparseddata = sc.textFile("hdfs:///tmp/new_cromosoma12v2.csv") val data = unparseddata.map { line => val parts = line.split(',').map(_.toDouble) LabeledPoint(parts.last%2, Vectors.dense(parts.slice(0, parts.length - 1))) } val hashingTF = new HashingTF() val tf: RDD[Vector] = hashingTF.transform(data) tf.cache() val idf = new IDF().fit(tf) val tfidf: RDD[Vector] = idf.transform(tf) val idfIgnore = new IDF(minDocFreq = 2).fit(tf) val tfidfIgnore: RDD[Vector] = idfIgnore.transform(tf) val splits = tfidfIgnore.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1))

But I have this error: <console>:90: error: inferred type arguments [org.apache.spark.mllib.regression.LabeledPoint] do not conform to method transform's type parameter bounds [D <: Iterable[_]] val tf: RDD[Vector] = hashingTF.transform(data)

Do you know how can I solve this problem?

Thank you,

Laia

2 REPLIES 2
Highlighted

Re: Hashing trick

@Laia Subparts There's a variant of HashingTF in the org.apache.spark.ml.feature namespace that operates on DataFrames (or DataSets in 2.0) that will be better for what you're trying to do. Rather than using a LabeledPoint, you'd just have one of your columns as the label. The example in https://spark.apache.org/docs/latest/ml-features.html#tf-idf includes a use of this HashingTF.

The problem you're seeing above is because the HashingTF in mllib is assuming you have something Iterable, like a Vector, and LabeledPoint isn't that.

Highlighted

Re: Hashing trick

Hello,

Now I can perform the hashing trick properly. I have tried:

val documents: RDD[Seq[String]] = sc.textFile("hdfs:///tmp/new_cromosoma12v2.csv").map(_.split(",").toSeq)

val hashingTF = new HashingTF() val tf: RDD[Vector] = hashingTF.transform(documents) val splits = tfidfIgnore.randomSplit(Array(0.7, 0.3))

val (trainingData, testData) = (splits(0), splits(1)) val numClasses = 3 val categoricalFeaturesInfo = Map[Int, Int]()

val numTrees = 10 val featureSubsetStrategy = "auto" val impurity = "gini" val maxDepth = 8 val maxBins = 32

val trainingData2=LabeledPoint(0.0,trainingData.collect())

val model = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

But now I have the following error

found : Array[org.apache.spark.mllib.linalg.Vector] required: org.apache.spark.mllib.linalg.Vector val trainingData2=LabeledPoint(0.0,trainingData.collect())

Do you know what can I do?

Thanks

Laia

Don't have an account?
Coming from Hortonworks? Activate your account here