Member since
08-11-2014
481
Posts
92
Kudos Received
72
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2024 | 01-26-2018 04:02 AM | |
3946 | 12-22-2017 09:18 AM | |
1997 | 12-05-2017 06:13 AM | |
2257 | 10-16-2017 07:55 AM | |
5905 | 10-04-2017 08:08 PM |
01-26-2018
05:41 AM
It looks like you didn't install some package that your notebook requires.
... View more
01-26-2018
04:02 AM
1 Kudo
I know these are well-known as feature requests, and ones I share. I don't know that they are planned for any particular release, but am sure these are tracked already as possible features.
... View more
12-29-2017
05:29 AM
1 Kudo
You can make a DataFrame over all the files and then filter out the lines you don't want. You can make a DataFrame for just the files you want, then union them together. Both are viable. If you're saying different data types are mixed into sections of each file, that's harder, as you need to use something like mapPartitions to carefully process each file 3 times.
... View more
12-27-2017
05:58 AM
Not sure what you're trying to do there, but looks like you have a simple syntax error. bucketBy is a method. Please start with the API docs first.
... View more
12-27-2017
05:57 AM
I think you mean something like df.write.mode(SaveMode.Overwrite).saveAsTable(...) ? Depends on what language this is.
... View more
12-26-2017
06:31 PM
I'm not sure what you're asking here. I have verified the project build works and all tests pass. Follow the tutorial at http://oryx.io/docs/endusers.html to get a working instance and take it from there.
... View more
12-26-2017
06:51 AM
You have to have an initial model before anything works. After that, of course, model scoring happens in real time and updates happen in near-real-time. I'm not sure what you mean in your second point. The word count example is correct. It's counting unique co-occurrences of words. If there is just one word on a line, there are no co-occurrences to count.
... View more
12-26-2017
06:40 AM
No need to ping. As far as I know nobody certifies pandas-Spark integration. We support Pyspark. It has a minimal integration with pandas (e.g. the toPandas method). If there were a Pyspark-side issue we'd try to fix it. But we don't support pandas.
... View more
12-23-2017
06:19 AM
No, because the speed layer also can't produce model updates unless it first sees a model. What do you mean that you can't see the data -- in what place?
... View more
12-22-2017
09:14 PM
It could be a lot of things, but keep in mind that you will not see any model updates unless you have a batch layer running, and it has had time to compute a model and send it to the serving layer. If the batch layer is running, check if it is able to produce a model. It would not with only 2 data points.
... View more
12-22-2017
09:13 PM
I'll answer in the other duplicated post.
... View more
12-22-2017
09:18 AM
1 Kudo
This looks like a mismatch between the version of pandas Spark uses and that you have on the driver, and whatever is installed with the workers on the executors.
... View more
12-15-2017
11:45 AM
If you're asking about EMR, this is the wrong place -- that's an Amazon product.
... View more
12-15-2017
07:55 AM
If you observe there are no jobs starting for an extended period of time, then I'd figure the driver is stuck listing S3 files. If you find the stage that reads the data from S3 is taking a long time, then I'd figure it's reading from S3. It's not certain that's the issue but certainly the place I'd look first. Because the S3 support is basically from Hadoop, you might look at https://wiki.apache.org/hadoop/AmazonS3 Are you using an s3a:// URI?
... View more
12-14-2017
04:00 PM
You're almost certainly bottlenecked on reading for S3, or listing the S3 directories, or both. You'd have to examine the stats in the Spark UI to really know. Given your simple job, it's almost certainly nothing to do with memory or CPU, or the shuffle. Try parallelizing more with more, smaller input files? splitting across more workers to make more bandwidth available? I'm also not clear if you're using a Cloudera cluster here.
... View more
12-12-2017
05:00 AM
Have a look at https://github.com/sryza/spark-timeseries for time series on Spark.
... View more
12-05-2017
06:13 AM
It now installs using Cloudera Manager, so yes you want the host to be part of the CM cluster to assign it to the workbench.
... View more
11-24-2017
04:17 AM
CDSW just uses the Jupyter kernel, so much of what you do in the Jupyter notebook would work here too. Magics will work. There's also an option to suppress this output in the output that is shared in the "Share" link, but that's not quite what you mean.
... View more
11-22-2017
05:56 AM
1 Kudo
I'm not sure how you would do that. We support spark-submit and the Workbench, not Jupyter. It's clear how to configure spark-submit, and you configure the workbench with spark-defaults.conf. You can see your Spark job's config in its UI, in the environment tab.
... View more
11-22-2017
05:42 AM
This has nothing to do with CM. It has to do with your app's memory configuration. The relevant settings are right there in the error.
... View more
11-17-2017
01:49 AM
If you have 4 topics with 3 partitions each then you need 12 executor slots to process fully in parallel. You have only 3 slots. If you are using receiver based streaming you may need 1 more, too. Also, 1 core per executor is generally very low. Your result is therefore not surprising and your second config much more reasonable.
... View more
11-14-2017
06:09 PM
You don't, in general. The cluster is where you run rather than create software. It is more like the JRE env than JDK. Lots of compiler tools aren't available and generally on purpose.
... View more
11-13-2017
11:14 AM
1 Kudo
No, it requires Spark 2.
... View more
11-13-2017
08:38 AM
Caching is OK in that Spark won't use more than it's allowed for caching, and you can turn that fraction down, if your app is heavily using memory for other things. Have a look at spark.memory.fraction or spark.memory.storageFraction. However that is only the issue if you're running out of memory on executors. If you're running out of driver memory try retaining a lot fewer job history details. Turn spark.ui.retained{Jobs,Stages,Tasks} way down to reduce that memory consumption. But the answer may simply be that you need more memory. I don't see evidence that 7G is necessarily enough, depending on what you are doing.
... View more
11-13-2017
04:21 AM
You're simply running out of memory. A large portion of memory is dedicated to caching data in Spark, of course, and that explains why a lot of memory has cached data. That's not necessarily the issue here. You may be retaining state on lots of jobs in the driver and that's eating memory in the driver (wasn't clear whether that's the heap you're showing). You can just incease memory, or look for ways to reduce memory usage.
... View more
11-12-2017
09:26 AM
Nothing about a cluster would prevent it from making external connections, but your firewall rules might. The variabbles you export here are not related to Spark. It's an error from the library you're using.
... View more
10-25-2017
10:27 AM
Oops I meant to write S3 paths. Really, it's Hadoop and its APIs that supports / doesn't support S3. It should be built in to Hadoop distributions however. I believe you might need an s3a:// protocol instead of s3://
... View more
10-25-2017
02:52 AM
Even I have forgotten exactly how it works off the top of my head, but yes, you are correct that you should be able to use HDFS paths. Yes it runs on Java 7 -- or 8, I believe, though I don't recall if that was tested. It doesn't require Java 8.
... View more
10-16-2017
07:55 AM
1 Kudo
Really, this is just saying you can upload data at project creation time or later from your local computer to the local file system that the Python/R/Scala sessions see in their local file system. Those jobs then see those local files as simple files, and can do what they like with them. But you can also within the same program access whatever data you want, anywhere you want; you just need to write code that does so. Via Spark or whatever library you want you can also access whatever data sources you want, as well. There is no either/or here.
... View more
10-06-2017
12:05 AM
Are you looking for the .jar files that were produced as part of the release? those are still in the repo and will stay there indefinitely as far as I know, just because it could be part of people's builds: https://repository.cloudera.com/artifactory/cloudera-repos/com/cloudera/oryx/
... View more