Member since
08-11-2014
481
Posts
92
Kudos Received
72
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2275 | 01-26-2018 04:02 AM | |
4655 | 12-22-2017 09:18 AM | |
2242 | 12-05-2017 06:13 AM | |
2525 | 10-16-2017 07:55 AM | |
6687 | 10-04-2017 08:08 PM |
01-26-2018
05:41 AM
It looks like you didn't install some package that your notebook requires.
... View more
01-26-2018
04:02 AM
1 Kudo
I know these are well-known as feature requests, and ones I share. I don't know that they are planned for any particular release, but am sure these are tracked already as possible features.
... View more
12-29-2017
05:29 AM
1 Kudo
You can make a DataFrame over all the files and then filter out the lines you don't want. You can make a DataFrame for just the files you want, then union them together. Both are viable. If you're saying different data types are mixed into sections of each file, that's harder, as you need to use something like mapPartitions to carefully process each file 3 times.
... View more
12-27-2017
05:58 AM
Not sure what you're trying to do there, but looks like you have a simple syntax error. bucketBy is a method. Please start with the API docs first.
... View more
12-27-2017
05:57 AM
I think you mean something like df.write.mode(SaveMode.Overwrite).saveAsTable(...) ? Depends on what language this is.
... View more
12-26-2017
06:31 PM
I'm not sure what you're asking here. I have verified the project build works and all tests pass. Follow the tutorial at http://oryx.io/docs/endusers.html to get a working instance and take it from there.
... View more
12-26-2017
06:51 AM
You have to have an initial model before anything works. After that, of course, model scoring happens in real time and updates happen in near-real-time. I'm not sure what you mean in your second point. The word count example is correct. It's counting unique co-occurrences of words. If there is just one word on a line, there are no co-occurrences to count.
... View more
12-26-2017
06:40 AM
No need to ping. As far as I know nobody certifies pandas-Spark integration. We support Pyspark. It has a minimal integration with pandas (e.g. the toPandas method). If there were a Pyspark-side issue we'd try to fix it. But we don't support pandas.
... View more
12-23-2017
06:19 AM
No, because the speed layer also can't produce model updates unless it first sees a model. What do you mean that you can't see the data -- in what place?
... View more
12-22-2017
09:14 PM
It could be a lot of things, but keep in mind that you will not see any model updates unless you have a batch layer running, and it has had time to compute a model and send it to the serving layer. If the batch layer is running, check if it is able to produce a model. It would not with only 2 data points.
... View more
12-22-2017
09:13 PM
I'll answer in the other duplicated post.
... View more
12-22-2017
09:18 AM
1 Kudo
This looks like a mismatch between the version of pandas Spark uses and that you have on the driver, and whatever is installed with the workers on the executors.
... View more
12-15-2017
11:45 AM
If you're asking about EMR, this is the wrong place -- that's an Amazon product.
... View more
12-15-2017
07:55 AM
If you observe there are no jobs starting for an extended period of time, then I'd figure the driver is stuck listing S3 files. If you find the stage that reads the data from S3 is taking a long time, then I'd figure it's reading from S3. It's not certain that's the issue but certainly the place I'd look first. Because the S3 support is basically from Hadoop, you might look at https://wiki.apache.org/hadoop/AmazonS3 Are you using an s3a:// URI?
... View more
12-14-2017
04:00 PM
You're almost certainly bottlenecked on reading for S3, or listing the S3 directories, or both. You'd have to examine the stats in the Spark UI to really know. Given your simple job, it's almost certainly nothing to do with memory or CPU, or the shuffle. Try parallelizing more with more, smaller input files? splitting across more workers to make more bandwidth available? I'm also not clear if you're using a Cloudera cluster here.
... View more
12-12-2017
05:00 AM
Have a look at https://github.com/sryza/spark-timeseries for time series on Spark.
... View more
12-05-2017
06:13 AM
It now installs using Cloudera Manager, so yes you want the host to be part of the CM cluster to assign it to the workbench.
... View more
11-24-2017
04:17 AM
CDSW just uses the Jupyter kernel, so much of what you do in the Jupyter notebook would work here too. Magics will work. There's also an option to suppress this output in the output that is shared in the "Share" link, but that's not quite what you mean.
... View more
11-22-2017
05:56 AM
1 Kudo
I'm not sure how you would do that. We support spark-submit and the Workbench, not Jupyter. It's clear how to configure spark-submit, and you configure the workbench with spark-defaults.conf. You can see your Spark job's config in its UI, in the environment tab.
... View more
11-22-2017
05:42 AM
This has nothing to do with CM. It has to do with your app's memory configuration. The relevant settings are right there in the error.
... View more
11-17-2017
01:49 AM
If you have 4 topics with 3 partitions each then you need 12 executor slots to process fully in parallel. You have only 3 slots. If you are using receiver based streaming you may need 1 more, too. Also, 1 core per executor is generally very low. Your result is therefore not surprising and your second config much more reasonable.
... View more
11-13-2017
11:14 AM
1 Kudo
No, it requires Spark 2.
... View more
11-12-2017
09:26 AM
Nothing about a cluster would prevent it from making external connections, but your firewall rules might. The variabbles you export here are not related to Spark. It's an error from the library you're using.
... View more
10-25-2017
10:27 AM
Oops I meant to write S3 paths. Really, it's Hadoop and its APIs that supports / doesn't support S3. It should be built in to Hadoop distributions however. I believe you might need an s3a:// protocol instead of s3://
... View more
10-25-2017
02:52 AM
Even I have forgotten exactly how it works off the top of my head, but yes, you are correct that you should be able to use HDFS paths. Yes it runs on Java 7 -- or 8, I believe, though I don't recall if that was tested. It doesn't require Java 8.
... View more
10-16-2017
07:55 AM
1 Kudo
Really, this is just saying you can upload data at project creation time or later from your local computer to the local file system that the Python/R/Scala sessions see in their local file system. Those jobs then see those local files as simple files, and can do what they like with them. But you can also within the same program access whatever data you want, anywhere you want; you just need to write code that does so. Via Spark or whatever library you want you can also access whatever data sources you want, as well. There is no either/or here.
... View more
10-06-2017
12:05 AM
Are you looking for the .jar files that were produced as part of the release? those are still in the repo and will stay there indefinitely as far as I know, just because it could be part of people's builds: https://repository.cloudera.com/artifactory/cloudera-repos/com/cloudera/oryx/
... View more
10-04-2017
08:08 PM
Oh, I forgot, we have made many obsolete repos in github.com/cloudera private. I can still see it but of course you can't. Here's a tarball of the final release: https://drive.google.com/open?id=0B_hfrkaWlLi4MVlxQWVJaVd0ZGs If there's any significant demand, I could revive the repo in my personal account
... View more
10-04-2017
08:01 PM
That class was added in Kafka 0.10, so normally I'd say you have a version mismatch, and you're using a version of Oryx built for Kafka 0.10+ with older Kafka. However you say you're using Kafka 0.11. But I think the issue is that something else is bringing older Kafka client libs onto your classpath. For example, I think the Spark examples JAR file brings it in? that could be a source of old Kafka 0.8 libs. Or somehow the classpath EMR sets up is bringing in Spark's Kafka 0.8 and 0.10 integration or something? This part is definitely tricky. I know it works on CDH, mostly because its Spark 2 distro only supports Kafka 0.10.
... View more
10-04-2017
07:55 PM
That implementation is obsolete at this point, I'd say, but sure you're welcome to go dig it out. It worked well. The releases and source are still on the 1.x project site: https://github.com/cloudera/oryx/releases
... View more