About sball

sball · ‎01-02-2016

This sounds like it may be a build problem. https://github.com/simonellistonball/spark-samples... has a working sample with sbt scripts to build against the Hortonworks repository, which has been testing on HDP 2.3.2. Note that the Kafka consumer API has changed a bit recently, so it's important to be aware of versions in Kafka. Also, I note that you're running in local model, we would recommend that you only use local mode for testing, and that you use --master yarn-client for running on a proper cluster.

sball · ‎12-03-2015

To do this you build a pipeline with the GetFiles processor, this can pick up files, and delete / move them afterwards (just as the spooldir source does). For the batching functionality you can use MergeContent, or other batching mechanisms on downstream Put processors.

sball · ‎11-08-2015

That's a start, however PMML support in Spark is a way off being complete. In particular no support for transformations yet. Spark would be a great platform for this, though is a very heavy platform to spin up for simple scoring in a NiFi.

sball · ‎11-08-2015

JPMML is a great library for evaluating PMML models, including things like feature transformation and a good range of model support. However, it's license is AGPL3, which makes it hard to include in Apache projects. I'm looking to evaluate PMML models as part of a custom NIFI processor, so need an evaluator library with and Apache license.

sball · ‎11-04-2015

The other thing to note is that to use Spark Packages, you also need z.addRepo("Spark Packages Repo").url("http://dl.bintray.com/spark-packages/maven") in the dep paragraph. There is currently a bug in the Zeppelin loader which prevents bringing in dependencies here, which we are working on, so for example in spark-csv, you may also have to manually app opencsv dependencies explicitly as well.

sball · ‎11-04-2015

There are a range of common NLP systems that work well on the platform. OpenNLP is a java native library which integrates well with, for example map reduce, and of course NLTK being a python system works well with pyspark. There are also native spark elements which are connected to NLP tasks: Latent Dirichlet Allocation for topic detection is one example. Of course the NLTK components also work well with Hive to do things like Tokenisation, and Part of Speech tagging. Stanford CoreNLP also provides a good toolkit of NLP functions. There is also a spark-package to integrate this with SparkML pipelines. Solr provides a number of useful tools that apply in the NLP space as well, such as stemming, synonym handling etc as part of its indexing and querying, so provides some building blocks for simple NLP analysis. There are also a number of commercial and partner solutions which handle NLP tasks. We are also looking to build tools for Entity Resolution on Spark, which will add to this.

sball · ‎10-24-2015

Tried this on a 2.3.2 cluster (brand new build) with 1.4.1, and had the same problem with Zeppelin and Magellan. Seems like Zeppelin is doing something to the context.

sball · ‎10-23-2015

That will work for HDP 2.2, but is not the way to do it on 2.3. In 2.3 we have a proper RPM based install. This stack has not yet been updated to reflect the new deployment mechanism.

sball · ‎10-20-2015

Mirror Maker works by consuming a source Kafka and producing into a destination Kafka. If I am producing messages with compression enabled into the source Kafka, is there a way to consume them in Mirror Maker without decompression, ie, just grab the raw compressed bits, and pass those on the wire to the target Kafka, or will the Consumer force decompression and recompression at the other end (meaning uncompressed data goes over the wire)?

sball · ‎10-08-2015

According to https://azure.microsoft.com/en-gb/documentation/articles/virtual-machines-a8-a9-a10-a11-specs/ The A8-9 instances support an RDMA 32MBs backplane for node to node communication on SLES. Is the SLES image the preferred / only image which support this networking layer, are there RedHat flavour alternatives. Would access to the 32MBs backplane through a multi-home topology make a significant difference to intra-cluster communication vs relatively small CPU scale in A8-9? Simon

Online	Offline
Last Visited	‎10-19-2020 01:00 PM

Member Since	‎09-15-2015 10:07 PM
Last Visited	‎10-19-2020 01:00 PM
Posts	116
Kudos received	121

Cloudera Community

Re: metron pcap query

Re: metron pcap data stored in HDFS sequence forma...

Re: Can Apache Metron be installed using CDH or EM...

Re: Installation failed with ambari, Can I retry t...

Re: metron installation on existed ambari managed ...

Re: How to run Kafka in spark

Re: NiFi equivalent to flume spooling directory so...

Re: Anyone know of a Java PMML evaluator with an A...

Anyone know of a Java PMML evaluator with an Apach...

Re: 3rd party packages in spark and zeppelin

Re: What is recommended NLP solution on top of HDP...

Re: Apache Zeppelin Tech Preview Live

Re: Is it possible to deploy and manage Solr throu...

Kafka MirrorMaker compression over the wire

Should I use RDMA in Microsoft Azure?