Created 11-08-2016 02:58 PM
This statement for bigram generation always gives the right result, no matter how much paralleliziation provided. Why does it always give the correct result? Due to sliding itself? Parallel processing for adding +1 to a number is easy to understand, but this here has to do with data over partitions.
Reading a big file or mapPartitions approach will result in minor loss of accuracy (verified practically although obvious), but why not here? It must be simple, but I cannot see it.
val rdd = sc.parallelize(Array("A", "B", "C", "D", "E", "F"),5) rdd.sliding(2).collect()
Created 11-08-2016 06:31 PM
@Gerard Alexander sliding() keeps track of the partition index, which in this case corresponds to the ordering of the unigrams. Compare rdd.mapPartitionsWithIndex { (i, p) => p.map { e => (i, e) } }.collect() and rdd.sliding(2).mapPartitionsWithIndex { (i, p) => p.map { e => (i, e) } }.collect() to help with the intuition.
Created 11-08-2016 06:31 PM
@Gerard Alexander sliding() keeps track of the partition index, which in this case corresponds to the ordering of the unigrams. Compare rdd.mapPartitionsWithIndex { (i, p) => p.map { e => (i, e) } }.collect() and rdd.sliding(2).mapPartitionsWithIndex { (i, p) => p.map { e => (i, e) } }.collect() to help with the intuition.
Created 11-08-2016 06:47 PM
Fantastic. OK, I sort of stated "due to sliding itself" but could not in fact make the link. Great stuff.
Created 11-09-2016 11:41 AM
So, just to finish the dicussion I have an inferior sliding algorithm like this using file parallelization:
val sentences = sc.textFile("/FileStore/tables/flcpmtie1478689806159/small_file.txt",5) val bigrams = sentences.map(sentence => sentence.trim.split(' ')).flatMap( wordList => for (i <- List.range(0, (wordList.length - 1))) yield ((wordList(i), wordList(i + 1)), 1) )
I also always get the correct bigrams, except on line boundaries, but leaving that aside. The code is as suggested by someone else.
This leads me to the idea that flatMap will also present the data in sequence of the partitions as you previously stated. Or is this not so.
Then, do these things all work in parallel as per map? Or is it sequentially looking for the boundaries? Or a combination of both?
I actually thought I might get an error with the above when using partitions.