Member since
02-22-2016
60
Posts
71
Kudos Received
27
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4507 | 07-14-2017 07:41 PM | |
1355 | 07-07-2017 05:04 PM | |
5136 | 07-07-2017 03:59 PM | |
940 | 07-06-2017 02:59 PM | |
2975 | 07-06-2017 02:55 PM |
10-21-2016
04:43 PM
Not sure what happened to the comment I made on doing this, so re-posting it. Part of this is about code-style. You generally don't want to define implicits at the top-level because it can make the code more difficult to reason about. For this reason it's common to tuck the implicits into a companion object (e.g., the relevant class or an Implicits object) and then import them just where you need them. This is probably the best use case for being able to do imports in the scope of a class, object or function -- you can apply an implicit without polluting the whole space.
... View more
10-21-2016
04:23 PM
intToRational is only in the scope of Rational in your code, so the conversion isn't available to Ints outside of the Rational class -- the reverse order (x + 2) works because 2 ends up being bound by the + inside Rational, where the conversion is available. What you want to do is create a companion Rational object, define intToRational there, and then you can import it in Rational, and outside of it (global scope, e.g.) too. As a minor note, I'd check out the Spire project for a complete set of rational classes, plus a whole lot more.
... View more
10-19-2016
08:35 PM
3 Kudos
@Raj B The SplitText processor has a "Header Line Count" property. If you set this to 1, you should be able to achieve what you want in generating multiple flow files, each with the same header. That said, if you're intending to insert these into Hive, you could actually use ConvertCSVToAvro too, setting the delimiter to '|' and then you'd have the data in batches which should give you better throughput.
... View more
10-05-2016
08:23 PM
3 Kudos
@Randy Gelhausen There are a few ways to do this.
Use the distributed map cache to get runtime attribute lookups and re-populate it as needed with new configs. Use a scripted processor to lookup your config values and merge the attributes on to the FlowFile. I have some work in progress extending a lookup table service by @Andrew Grande that can do lookups against a properties file that is reloaded periodically. It includes a LookupAttribute processor that can merge in either specific properties or all the properties from a properties file. See: https://github.com/jfrazee/nifi-lookup-service/tree/file-based-lookup-service
... View more
10-05-2016
08:05 PM
1 Kudo
@Timothy Spann ProcessorLog was removed between HDF 1.2/NiFi 0.6.x and HDF 2.0/NiFi 1.0 (see https://github.com/apache/nifi/pull/403) and that processor builds against the NiFi 0.6.x libraries, so it's going to need its dependencies updated to NiFi 1.0.0 to run under HDF 2.0.
... View more
09-29-2016
05:07 PM
Yes. That's sort of what I had in mind, but it'll still depend on how balanced/imbalanced your data is. There are algorithms for doing this more intelligently too but I've never looked at how to do them in Spark. It looks like the FPGrowth() classes expose a support proportion, but I can't quite tell what it does if you have, e.g., 10k 1's and 100 items with count > 1. I probably can't take you much further without doing some reading.
... View more
09-29-2016
03:47 PM
That percentage will certainly vary by domain so I don't know what normal will be. I will note that to do that on a large data set you'll need a step in your job to approximate where the cutoff is, but that's easy enough using the sampling methods exposed on the RDD.
... View more
09-28-2016
11:33 PM
@Pedro Rodgers Sorry, I meant to write to put the .filter() after the .reduceByKey(). Edited it in the original answer. It should run now. Yes, it's filtering out/eliminating occurrences with counts less than or equal to 2. If your data/training time isn't too big, you can probably tune that and your confidence level empirically using a grid search.
... View more
09-28-2016
04:45 PM
@Pedro Rodgers Three things jump out at me:
There are a ton of examples in your sample data with just one or two occurrences of the pattern. For this particular algorithm and its usual application they're not going to be very useful. Your confidence is quite high, considering the size of the sample data and the evidence for the different patterns. The learner came up with 1165 rules for 1185 data points. I re-ran your code including a .filter(_._2 > 2) after the .reduceByKey(_ + _) and lowered the confidence to 0.6 and I get 20 or so rules now with confidence varying between 0.6 and 1.0. I suspect if you carefully go through the results you were getting before you might see that it was just learning a 1-to-1 mapping between input and output, so the confidence of 1.0 is justified, but the generalization of the model is bad.
... View more
09-01-2016
06:21 PM
1 Kudo
Not really sure why it's not loading the examples for you but the JSON input that page should have loaded is: {
"Rating": 1,
"SecondaryRatings": {
"Design": 4,
"Price": 2,
"RatingDimension3": 1
}
}
And the Jolt spec is: [
{
"operation": "shift",
"spec": {
"Rating": "rating-primary",
//
// Turn all the SecondaryRatings into prefixed data
// like "rating-Design" : 4
"SecondaryRatings": {
// the "&" in "rating-&" means go up the tree 0 levels,
// grab what is ther and subtitute it in
"*": "rating-&"
}
}
}
]
... View more