Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Getting Error while executing this command

avatar
rdd = sc.parallelize(r1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  c = list(c)  # Make it a list so we can compute its length
TypeError: 'PipelinedRDD' object is not iterable

~~~~~~~~~~~~~~~~~My commands are ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

>>> R = sc.textFile(filename);
>>> R.collect()
>>> r1 = R.map(lambda s: s.split(","))
>>> r1.collect()
>>> rdd = sc.parallelize(r1)
1 ACCEPTED SOLUTION

avatar
New Member

R is an RDD. So r1 is also an RDD.

So you are trying to call "parallelize()" on an RDD, where you should not do that. Usually, use parallelize() on a local python object, like a list.

View solution in original post

2 REPLIES 2

avatar
New Member

R is an RDD. So r1 is also an RDD.

So you are trying to call "parallelize()" on an RDD, where you should not do that. Usually, use parallelize() on a local python object, like a list.

avatar

Additionally, if you want to change number of partitions (and then parallelism) of an existing RDD, you can use

rdd.repartition(8)

See the comments and tests from here: https://community.hortonworks.com/questions/5825/best-way-to-select-distinct-values-from-multiple-c....