Archives of Support Questions (Read Only)

lakumivnarayana · ‎12-11-2015

rdd = sc.parallelize(r1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  c = list(c)  # Make it a list so we can compute its length
TypeError: 'PipelinedRDD' object is not iterable

~~~~~~~~~~~~~~~~~My commands are ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

>>> R = sc.textFile(filename);
>>> R.collect()
>>> r1 = R.map(lambda s: s.split(","))
>>> r1.collect()
>>> rdd = sc.parallelize(r1)

ofermend · ‎12-11-2015

R is an RDD. So r1 is also an RDD.

So you are trying to call "parallelize()" on an RDD, where you should not do that. Usually, use parallelize() on a local python object, like a list.

View solution in original post

ofermend · ‎12-11-2015

R is an RDD. So r1 is also an RDD.

So you are trying to call "parallelize()" on an RDD, where you should not do that. Usually, use parallelize() on a local python object, like a list.

gbraccialli3 · ‎12-11-2015

Additionally, if you want to change number of partitions (and then parallelism) of an existing RDD, you can use

rdd.repartition(8)

See the comments and tests from here: https://community.hortonworks.com/questions/5825/best-way-to-select-distinct-values-from-multiple-c....

Cloudera Community

Archives of Support Questions (Read Only)

Getting Error while executing this command