Created 12-11-2015 11:12 AM
rdd = sc.parallelize(r1) Traceback (most recent call last): File "<stdin>", line 1, in <module> c = list(c) # Make it a list so we can compute its length TypeError: 'PipelinedRDD' object is not iterable
~~~~~~~~~~~~~~~~~My commands are ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> R = sc.textFile(filename); >>> R.collect() >>> r1 = R.map(lambda s: s.split(",")) >>> r1.collect() >>> rdd = sc.parallelize(r1)
Created 12-11-2015 09:55 PM
R is an RDD. So r1 is also an RDD.
So you are trying to call "parallelize()" on an RDD, where you should not do that. Usually, use parallelize() on a local python object, like a list.
Created 12-11-2015 09:55 PM
R is an RDD. So r1 is also an RDD.
So you are trying to call "parallelize()" on an RDD, where you should not do that. Usually, use parallelize() on a local python object, like a list.
Created 12-11-2015 10:36 PM
Additionally, if you want to change number of partitions (and then parallelism) of an existing RDD, you can use
rdd.repartition(8)
See the comments and tests from here: https://community.hortonworks.com/questions/5825/best-way-to-select-distinct-values-from-multiple-c....