Member since
09-15-2015
116
Posts
141
Kudos Received
40
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1844 | 02-05-2018 04:53 PM | |
2392 | 10-16-2017 09:46 AM | |
2077 | 07-04-2017 05:52 PM | |
3106 | 04-17-2017 06:44 PM | |
2284 | 12-30-2016 11:32 AM |
03-22-2016
05:08 PM
3 Kudos
PMML is certainly a good option, but be aware that Spark does not support the transformation elements of PMML, so you will need to recreate any feature scaling and transformation before the scoring step. The other thing to note is that many of the Spark Model classes do not depend on the spark context, so you can link spark to you storm topology and just use the Spark Model itself. This can lead to some unnecessary code in your jar, but has the advantage that you don't need to go through the PMML format.
... View more
01-27-2016
03:58 PM
Good point, thanks Mark, I've updated my answer to FetchSFTP, since it needs the FlowFile inputs.
... View more
01-27-2016
03:39 PM
1 Kudo
You could feed the list of servers and files in as attributes on flow files from some list source. This could be an ExecuteSQL process against HiveServer. You would split the results, extract the relevant columns as attributes. This would then be used to parameterize the settings in a FetchSFTP processor through expression language. You can then run multiple concurrent threads of the FetchSFTP processor to work the requests in parallel by changing the concurrent tasks option in the scheduling tab of the processor configuration.
... View more
01-24-2016
09:09 AM
4 Kudos
Hive doesn't support non-equijoins yet. This is coming soon.
... View more
01-13-2016
03:55 PM
2 Kudos
There may be some marginal gain in terms of network backplane throughput, however, it's not really necessary, and balanced against cost, availability and flexibility. The A8-11 instances are more intended for traditional HPC which require non-commodity networking. They are relatively rare compared to the more commodity backed instances in Azure, so can be hard to provision in some regions in large volume. The other key consideration is that they are not portable to other instance classes, so some of the elasticity benefits are lost. In short, you could in theory need the RDMA networking for very heavy shuffle ML (maybe for deep learning or some of the newer neural net and graph algorithms in spark) but the cost doesn't usually justify this, and you're usually going to be better off with D class instances for YARN and HDFS.
... View more
01-11-2016
07:37 PM
2 Kudos
One means of doing this might be to use Web Services to do the enrichment. For example, you could use Arc's REST service for geocoding, and invoke this from a Get/InvokeHTTP processor, passing parameters from the FlowFile arguments.
... View more
01-08-2016
03:41 PM
More example included in original answer, showing a more efficient method. Aggregation per column probably really a different question. Feel free to ask it if you still need!
... View more
01-06-2016
09:24 AM
2 Kudos
The rdd.sample approach should do this for you. Note that sample gets a fraction of the number of lines, so you may need to know line count. It will also return approximately one line based on your fraction, so it may make sense to over sample and then filter from a small result set to get exactly one. Alternatively use sampleExact, which may be a little slower, but will give you an exact count. Given the multiple interactions, it may make sense to get all your samples in one rdd before interating. Not that it you sample withReplacement: true you will get the same effect as multiple runs of sample over the same set. This approach is based on flattening your entire set of text files into an RDD[filename: String, line: String] and then using sampleByKey. You can also probably do the repeat sampling in local code (scala here, but similar approach works in java / python) as opposed to pure spark: import java.util.Random
val seed = 1234
val sampleCount = 1000
val rand = new Random(seed);
val files = sc.wholeTextFiles("*.txt")
val filesAndLines = files.map(x=>(x._1,x._2.split("\n")))
val manySamples = filesAndLines.map(x=>(x._1, List.fill(sampleCount)(x._2(rand.nextInt(x._2.length))))) Note here we are using Random, so your samples won't be strictly random, since Random.randInt is a normal random. For uniform random use different random generation. If the files were larger you would be better off doing this job in a more Spark way: val fractionsScaled = filesAndLines.map(x => (x._1, sampleCount/x._2.length.toDouble)).collect().toMap
linesByFile.sampleByKeyExact(withReplacement = true, fractions = fractionsScaled) Note here I'm producing fractions which are greater than 1 if there are fewer lines than the required number of iterations, but since we're using withReplacement, we are effectively simulating taking a single sample many times at 1/lines. This also has the advantage that the Spark sampler is Poisson (withReplacement: true) so should be based on a more uniform randomness.
... View more
01-05-2016
10:56 AM
1 Kudo
There are a number of online translation services which can be used to do this. Most of them work as REST APIs, which you can integrate into your ingestion process, whether that is through realtime ingest via something like Storm, or post processing through a custom UDF, or Oozie process. Something to look at would be the YandexTranslate processor in Hortonworks Data Flow. So you could for example use the ExecuteSQL process to get data out of your SQL Server and then translate the content with the YandexTranslate processor, before using PutHDFS to store the data in HDP.
... View more