About srowen

katia · ‎08-25-2015

thank you very much

horatio · ‎08-24-2015

Actually, I don't know the exact reasons and had stuck in this problem for a few day with firewalls on all machines disabled at very first. I used to deploy hadoop, spark and so on by extracting source tarballs. Forturnately, edge node seems to be a good idea to acess cluster resources.

srowen · ‎08-06-2015

I don't think it has to do with functional programming per se, but yes, it's because the function/code being executed has to be sent from the driver to the executors, and so the function object itself must be serializable. It has no relation to security.

srowen · ‎08-05-2015

If you call persist() on an RDD, it means that the data in the RDD will be persisted but only later when something causes it to be computed for the first time. It is not immediately evaluated.

srowen · ‎07-27-2015

The first case is: read - shuffle - persist - count The second case is: read (from persisted copy) - count You are right that coalesce does not always shuffle, but it may in this case. It depends on whether you started with more or fewer partitions. You should look at the Spark UI to see whether a shuffle occurred.

Saeed.Barghi · ‎07-26-2015

I don't think that was the problem, I changed the code as below and it worked. The issue was in toDF method: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.sql._ import org.apache.spark.sql.types._ import org.apache.spark.SparkConf import sys.process._ class cc extends Runnable{ val conf = new SparkConf().setAppName("LoadDW") val sc = new SparkContext(conf) val sqlContext= new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ override def run(): Unit = { var fileName = "DimCustomer.txt" val fDimCustomer = sc.textFile("DimCustomer.txt") val schemaString = "ID Name City EffectiveFrom EffectiveTo" val schema = StructType(List( StructField("ID", IntegerType, true), StructField("Name", StringType, true), StructField("City", StringType, true), StructField("EffectiveFrom", IntegerType, true), StructField("EffectiveTo", IntegerType, true) ) ) println("----->>>>>>sdsdsd2222\n") var dimCustomerRDD = fDimCustomer.map(_.split(',')).map(r=>Row(r(0).toInt,r(1),r(2),r(3).toInt,r(4).toInt)) var customerDataFrame = sqlContext.createDataFrame(dimCustomerRDD, schema) customerDataFrame.registerTempTable("Cust_1") val customers = sqlContext.sql("select * from Cust_1") customers.show() println("+") } } object pp extends App { val cp = new cc() val rThread = new Thread (cp) rThread.start() }

srowen · ‎07-02-2015

It is just polling HDFS for new files on the order of ~5 minutes or so. No that message is exactly from this process of refreshing the model by looking for any new model. "No available generation" means no models have been built. There's a delay between the time new data arrives -- which could include a new user or item -- and when that is incorporated into a model. It could be a long time depending on how long you take to build models. When a new model arrives, you can't just drop all existing users, since the new model won't have any info about very new users or items. This is to help keep track of which users/items should be retained in memory even if they do not exist in the new model. The new model replaces the old one user-by-user and item-by-item rather than by loading an entire new model. Yes you have a state with old and new data at once but this is fine for recommendations; they're not incompatible. It's just the current and newer state of an estimate of the user/item vectors.

ClouderaUser52 · ‎06-29-2015

Thank you Sean for the answer, I actually misspoke and just need to upgrade to Spark 1.3 (I'm using Spark 1.2). I've been trying to use this guide: https://s3.amazonaws.com/quickstart-reference/cloudera/hadoop/latest/doc/Cloudera_EDH_on_AWS.pdf But I am still only getting Spark 1.2, do you have any suggestions on how I can use this guide to get Spark 1.3?

Jason.Chen · ‎06-29-2015

Cool. Just read your changes and it seems it only impacts the local computation (not Hadoop computation). Correct? Yes, I know Hadoop computation is already doing the right thing and no need to fix.

Ying Lu · ‎06-23-2015

Hi Sean, You are right. It has to do with config. I have figured it out. Thanks so much! Ying

Online	Offline
Last Visited	‎02-13-2018 12:34 PM

Member Since	‎08-11-2014 09:17 AM
Last Visited	‎02-13-2018 12:34 PM
Posts	481
Kudos received	87

Cloudera Community

Re: Own code editor in CDSW?

Re: error using Pandas within PySpark transformati...

Re: Does CDSW need to be part of the cluster?

Re: Local Data combined with HDFS

Re: Where can I find Oryx 1.x releases (or GitHub)

Re: Oryx and Yarn

Re: Run Oryx on a machine that is not part of the ...

Re: What is the reason behind Spark Functions exte...

Re: What does it mean, Spark persist call on its o...

Re: Benefit of DISK_ONLY persists

Re: Spark Exception: Task Not Serializable

Re: How Oryx serving layer knows there is new mode...

Re: Issues upgrading Spark from Spark 1.3 -> Spark...

Re: Lost of users after training

Re: training and testing data