Created on 08-05-2015 12:31 PM - edited 09-16-2022 02:36 AM
I am trying to verify cogroup join and groupByKey for PairRDDs. I could check that in Spark Java API. But, cannot do it with scala project
Below is the simple code that i tried, let me know where i made mistake.
object PairsCheck {
def main(args: Array[String]) = {
val conf = new SparkConf;
val sc = new SparkContext(conf)
val lines = sc.textFile("/home/test1.txt")
val lines2 = sc.textFile("/home/test2.txt")
val words = lines.flatMap { x => x.split("\\W+") }
val words2 = lines2.flatMap { x => x.split("\\W+") }
val pairs: RDD[(Int, String)] = words.map {case(x) => (x.length(), x) }
val pairs2: RDD[(Int, String)] = words2.map {case(x) => (x.length(), x) }
import org.apache.spark.SparkContext._
// --> Here i tried to call join/co group functions that applies for pairsRDD, but could not do that. If i call join, it is throwing error.
Thank you in advance.
Created 08-14-2015 07:42 AM
Hi,
To create a pair-RDD from a RDD, I used the "keyBy" transformation to extract the key from each value:
val fileC = sc.textFile("hdfs://.../user/.../myfile.txt") .keyBy(line => line.substring(5,13).trim()) .mapValues(line => ( line.substring(87,92).trim() , line.substring(99,112).trim() , line.substring(120,126).trim() , line.substring(127,131).trim() ) )
The "keyBy" provides me a new pair-RDD for which the key is a substring of my text value.
Then the "mapValues" transformations opers like a "map" one on each value of my pair-RDD, not on keys...
Created 08-14-2015 07:42 AM
Hi,
To create a pair-RDD from a RDD, I used the "keyBy" transformation to extract the key from each value:
val fileC = sc.textFile("hdfs://.../user/.../myfile.txt") .keyBy(line => line.substring(5,13).trim()) .mapValues(line => ( line.substring(87,92).trim() , line.substring(99,112).trim() , line.substring(120,126).trim() , line.substring(127,131).trim() ) )
The "keyBy" provides me a new pair-RDD for which the key is a substring of my text value.
Then the "mapValues" transformations opers like a "map" one on each value of my pair-RDD, not on keys...