About srowen

srowen · ‎01-26-2018

I know these are well-known as feature requests, and ones I share. I don't know that they are planned for any particular release, but am sure these are tracked already as possible features.

srowen · ‎12-29-2017

You can make a DataFrame over all the files and then filter out the lines you don't want. You can make a DataFrame for just the files you want, then union them together. Both are viable. If you're saying different data types are mixed into sections of each file, that's harder, as you need to use something like mapPartitions to carefully process each file 3 times.

srowen · ‎12-27-2017

Not sure what you're trying to do there, but looks like you have a simple syntax error. bucketBy is a method. Please start with the API docs first.

srowen · ‎12-27-2017

I think you mean something like df.write.mode(SaveMode.Overwrite).saveAsTable(...) ? Depends on what language this is.

srowen · ‎12-22-2017

This looks like a mismatch between the version of pandas Spark uses and that you have on the driver, and whatever is installed with the workers on the executors.

srowen · ‎12-05-2017

It now installs using Cloudera Manager, so yes you want the host to be part of the CM cluster to assign it to the workbench.

srowen · ‎11-22-2017

I'm not sure how you would do that. We support spark-submit and the Workbench, not Jupyter. It's clear how to configure spark-submit, and you configure the workbench with spark-defaults.conf. You can see your Spark job's config in its UI, in the environment tab.

srowen · ‎11-22-2017

This has nothing to do with CM. It has to do with your app's memory configuration. The relevant settings are right there in the error.

srowen · ‎11-17-2017

If you have 4 topics with 3 partitions each then you need 12 executor slots to process fully in parallel. You have only 3 slots. If you are using receiver based streaming you may need 1 more, too. Also, 1 core per executor is generally very low. Your result is therefore not surprising and your second config much more reasonable.

srowen · ‎11-13-2017

No, it requires Spark 2.

Online	Offline
Last Visited	‎02-13-2018 12:34 PM

Member Since	‎08-11-2014 09:17 AM
Last Visited	‎02-13-2018 12:34 PM
Posts	481
Kudos received	87

Cloudera Community

Re: Own code editor in CDSW?

Re: error using Pandas within PySpark transformati...

Re: Does CDSW need to be part of the cluster?

Re: Local Data combined with HDFS

Re: Where can I find Oryx 1.x releases (or GitHub)

Re: Own code editor in CDSW?

Re: How to split the dataframe of multiple files i...

Re: AttributeError: Dataframe has no attribute bu...

Re: Not able to append records to table using df.w...

Re: error using Pandas within PySpark transformati...

Re: Does CDSW need to be part of the cluster?

Re: ExecutorLostFailure Reason: Container killed ...

Re: ExecutorLostFailure Reason: Container killed ...

Re: What's the right number of cores and executors...

Re: CDS Workbench with Spark Versions Below 2.0