About srowen

srowen · ‎07-21-2017

(Spark 2.2 at this point?) The services you mention are really the history server services only. You can run both, for Spark 1 and Spark 2 parallel installations, yes. You probably want both gateway roles on all nodes that you intend to run both types of Spark jobs from, yeah, but that's up to you. I don't think there's anything else to know about managing them. The only thing to manage is the history server and it's a simple creature that CM manages. All the scripts are differently named (spark2-submit vs spark-submit) so there should be no conflict.

srowen · ‎07-21-2017

It's as integrated and supported as anything else. If you mean replacing Spark 1, no that's not possible as it would break any apps using Spark 1. If you mean you just don't want a separate add-on, I get it, but, it's just a delivery mechanism. C6 would not have both.

srowen · ‎07-21-2017

You can execute SQL statements in Pyspark. Same metastore, same data as you are accessing from Hive or Impala. I think one of the premises of the workbench is: edit code, not notebooks. Because that makes it much more realistic to create code that's then used in 'production'. The translation step is an obstacle. I personally think you should use your IDE to do non-interactive software development, and use the workbench for the interactive parts, all within one project. This was my take on it, for Scala: https://github.com/srowen/cdsw-simple-serving I think Jupyter is harder to fit into this vision because it's operating in terms of notebooks, not code at heart. Zeppelin, less so. So I think we're actually aligned and think the workbench is trying to do what you want. But that's the answer we provide. You can use Zeppelin but you're on your own, and if it's a little tricky, well yeah that's part of the point.

srowen · ‎07-21-2017

HUE is the supported and recommended tool for SQL (Impala, Hive). The HUE notebook is not supported. The workbench is the supported and recommended tool for Spark, Python, R, and Scala. Kerberos and security works. Zeppelin, Jupyter are not supported and it's safe to say there are no plans to do so. What features are you looking for? HUE + workbench should cover everything you mention. I don't know of a difference with Zeppelin in this respect. What's a blue elephant guy?

srowen · ‎06-27-2017

(Please start a new thread) Yes, all scores are cumluative and added across all input. I am not sure what your use case, but what you're suggesting is how it works: submitting (user, item, 1) adds 1 to the total strength of interaction.

srowen · ‎06-19-2017

Spark deals with arbitrary data, so its notion of partitions is not related to data that contains a key. However it's almost surely true that one key-based partition of data in, say, Parquet will map to one (or more) partitions of data in a DataFrame that just has data with that key.

srowen · ‎06-19-2017

If you mean partition in the sense of Parquet/Avro partitions by some key, that should be possible to preserve this way. In the general case of things like text files, a file is a partition already.

srowen · ‎06-19-2017

It should be pretty trivial to read the data in format X using Spark into a DataFrame or Dataset, then repartition it to a smaller number of partitions, and write it in format X using Spark. The round-trip ought not change the data, but worth verifying. It should however always result in fewer and therefore larger files.

srowen · ‎06-12-2017

Will it affect recommendations? yes. However, it doesn't affect the results much until you turn the LSH value down a lot. For example, you may find that 0.1 still yields good recommendations. With 1.7M items however, you should find it's already pretty fast. See http://oryx.io/docs/performance.html . Even with 250 latent features, at LSH=0.3, you could probably serve ~100qps on one modern server with latency ~15ms.

srowen · ‎05-22-2017

So from other sources, I see notes about the incompatibility. It sounds like the 0.10.2 release was fixed to be compatible across maintenance releases? so if the project used 0.10.2, I think that would work for you and all 0.10.x brokers?

Online	Offline
Last Visited	‎02-13-2018 12:34 PM

Member Since	‎08-11-2014 09:17 AM
Last Visited	‎02-13-2018 12:34 PM
Posts	481
Kudos received	87

Cloudera Community

Re: Own code editor in CDSW?

Re: error using Pandas within PySpark transformati...

Re: Does CDSW need to be part of the cluster?

Re: Local Data combined with HDFS

Re: Where can I find Oryx 1.x releases (or GitHub)

Re: Having Spark 1.6.0 and 2.1 in the same CDH

Re: spark 2.2 parcel availability in CDH

Re: Cloudera and Notebooks (Zeppelin/HUE/Jupyter)

Re: Cloudera and Notebooks (Zeppelin/HUE/Jupyter)

Re: Oryx2 Kafka Broker Issue

Re: Any good methods for compacting small files in...

Re: Any good methods for compacting small files in...

Re: Any good methods for compacting small files in...

Re: Oryx 2 recommenadtion || increase Throughput

Re: Oryx2 Kafka Broker Issue