Support Questions

lakumivnarayana · ‎12-11-2015

rdd = sc.parallelize(r1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  c = list(c)  # Make it a list so we can compute its length
TypeError: 'PipelinedRDD' object is not iterable

~~~~~~~~~~~~~~~~~My commands are ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

>>> R = sc.textFile(filename);
>>> R.collect()
>>> r1 = R.map(lambda s: s.split(","))
>>> r1.collect()
>>> rdd = sc.parallelize(r1)

ofermend · ‎12-11-2015

R is an RDD. So r1 is also an RDD.

So you are trying to call "parallelize()" on an RDD, where you should not do that. Usually, use parallelize() on a local python object, like a list.

View solution in original post

ofermend · ‎12-11-2015

R is an RDD. So r1 is also an RDD.

So you are trying to call "parallelize()" on an RDD, where you should not do that. Usually, use parallelize() on a local python object, like a list.

gbraccialli3 · ‎12-11-2015

Additionally, if you want to change number of partitions (and then parallelism) of an existing RDD, you can use

rdd.repartition(8)

See the comments and tests from here: https://community.hortonworks.com/questions/5825/best-way-to-select-distinct-values-from-multiple-c....

Cloudera Community

Support Questions

Getting Error while executing this command

Execute Stream Command - NIFI

Logging Executed SQL Commands

Troubleshooting ambari operation execution using "...

Nifi toolkit pg-list fails with message : Error ex...

Getting error while executing hive merge

Getting Started with Spark GraphFrames in Cloudera...

Execute commands on remote SFTP server from NiFi

Execute Stream Command - Unable to remove files us...

[How-To] Resolving Ambari api error when executing...

Getting error during batch execution of records in...