About sball

sball · ‎03-22-2016

PMML is certainly a good option, but be aware that Spark does not support the transformation elements of PMML, so you will need to recreate any feature scaling and transformation before the scoring step. The other thing to note is that many of the Spark Model classes do not depend on the spark context, so you can link spark to you storm topology and just use the Spark Model itself. This can lead to some unnecessary code in your jar, but has the advantage that you don't need to go through the PMML format.

sball · ‎01-27-2016

Good point, thanks Mark, I've updated my answer to FetchSFTP, since it needs the FlowFile inputs.

sball · ‎01-27-2016

You could feed the list of servers and files in as attributes on flow files from some list source. This could be an ExecuteSQL process against HiveServer. You would split the results, extract the relevant columns as attributes. This would then be used to parameterize the settings in a FetchSFTP processor through expression language. You can then run multiple concurrent threads of the FetchSFTP processor to work the requests in parallel by changing the concurrent tasks option in the scheduling tab of the processor configuration.

sball · ‎01-24-2016

Hive doesn't support non-equijoins yet. This is coming soon.

sball · ‎01-13-2016

Sort of... answering my own question below...

sball · ‎01-13-2016

There may be some marginal gain in terms of network backplane throughput, however, it's not really necessary, and balanced against cost, availability and flexibility. The A8-11 instances are more intended for traditional HPC which require non-commodity networking. They are relatively rare compared to the more commodity backed instances in Azure, so can be hard to provision in some regions in large volume. The other key consideration is that they are not portable to other instance classes, so some of the elasticity benefits are lost. In short, you could in theory need the RDMA networking for very heavy shuffle ML (maybe for deep learning or some of the newer neural net and graph algorithms in spark) but the cost doesn't usually justify this, and you're usually going to be better off with D class instances for YARN and HDFS.

sball · ‎01-11-2016

One means of doing this might be to use Web Services to do the enrichment. For example, you could use Arc's REST service for geocoding, and invoke this from a Get/InvokeHTTP processor, passing parameters from the FlowFile arguments.

sball · ‎01-08-2016

More example included in original answer, showing a more efficient method. Aggregation per column probably really a different question. Feel free to ask it if you still need!

sball · ‎01-06-2016

The rdd.sample approach should do this for you. Note that sample gets a fraction of the number of lines, so you may need to know line count. It will also return approximately one line based on your fraction, so it may make sense to over sample and then filter from a small result set to get exactly one. Alternatively use sampleExact, which may be a little slower, but will give you an exact count. Given the multiple interactions, it may make sense to get all your samples in one rdd before interating. Not that it you sample withReplacement: true you will get the same effect as multiple runs of sample over the same set. This approach is based on flattening your entire set of text files into an RDD[filename: String, line: String] and then using sampleByKey. You can also probably do the repeat sampling in local code (scala here, but similar approach works in java / python) as opposed to pure spark: import java.util.Random val seed = 1234 val sampleCount = 1000 val rand = new Random(seed); val files = sc.wholeTextFiles("*.txt") val filesAndLines = files.map(x=>(x._1,x._2.split("\n"))) val manySamples = filesAndLines.map(x=>(x._1, List.fill(sampleCount)(x._2(rand.nextInt(x._2.length))))) Note here we are using Random, so your samples won't be strictly random, since Random.randInt is a normal random. For uniform random use different random generation. If the files were larger you would be better off doing this job in a more Spark way: val fractionsScaled = filesAndLines.map(x => (x._1, sampleCount/x._2.length.toDouble)).collect().toMap linesByFile.sampleByKeyExact(withReplacement = true, fractions = fractionsScaled) Note here I'm producing fractions which are greater than 1 if there are fewer lines than the required number of iterations, but since we're using withReplacement, we are effectively simulating taking a single sample many times at 1/lines. This also has the advantage that the Spark sampler is Poisson (withReplacement: true) so should be based on a more uniform randomness.

sball · ‎01-05-2016

There are a number of online translation services which can be used to do this. Most of them work as REST APIs, which you can integrate into your ingestion process, whether that is through realtime ingest via something like Storm, or post processing through a custom UDF, or Oozie process. Something to look at would be the YandexTranslate processor in Hortonworks Data Flow. So you could for example use the ExecuteSQL process to get data out of your SQL Server and then translate the content with the YandexTranslate processor, before using PutHDFS to store the data in HDP.

Online	Offline
Last Visited	‎10-19-2020 01:00 PM

Member Since	‎09-15-2015 10:07 PM
Last Visited	‎10-19-2020 01:00 PM
Posts	116
Kudos received	121

Cloudera Community

Re: metron pcap query

Re: metron pcap data stored in HDFS sequence forma...

Re: Can Apache Metron be installed using CDH or EM...

Re: Installation failed with ambari, Can I retry t...

Re: metron installation on existed ambari managed ...

Re: How to Use Spark MLLib Model in Storm?

Re: Several getSFTP in parallel to different sever...

Re: Several getSFTP in parallel to different sever...

Re: Hive query issue

Re: Should I use RDMA in Microsoft Azure?

Re: Should I use RDMA in Microsoft Azure?

Re: NiFi and GeoLocalization

Re: What is the best way to read a random line fro...

Re: What is the best way to read a random line fro...

Re: Is there any tool in Hadoop which can do the l...