About jfrazee

jfrazee · ‎07-14-2017

@Marceli Stawicki There are a few ways to do this. The most common way is to read the FlowFile contents in a ProcessSession#read and assign the contents to an AtomicReference. For example: final AtomicReference<String> contentsRef = new AtomicReference<>(null); session.read(flowFile, new InputStreamCallback() { @Override public void process(final InputStream in) throws IOException { final String contents = IOUtils.toString(in, "UTF-8"); contentsRef.set(contents); } }) final String contents = contentsRef.get(); Another approach is to use ProcessSession#exportTo: final ByteArrayOutputStream bytes = new ByteArrayOutputStream(); session.exportTo(flowFile, bytes); final String contents = bytes.toString(); From there you should be able to compare contents to the value of the property.

jfrazee · ‎07-07-2017

@Debra Montague To do this, you will have to do 3 things: Set nifi.kerberos.krb5.file=/etc/krb5.conf or whatever in your ./conf/nifi.properties file on all HDF cluster nodes Make sure that the appropriate keytab is accessible on all HDF cluster nodes (i.e., copy or syndicate the keytabs to every node) Configure the Put/Get/List/FetchHDFS processor with the "Kerberos Keytab" set to the path of the keytab and "Kerberos Principal" set to a principal in that keytab You might find some of these articles helpful: https://community.hortonworks.com/questions/84659/how-to-use-apache-nifi-on-kerberized-hdp-cluster-n.html https://community.hortonworks.com/questions/88311/how-to-connect-to-kerberized-hdp-cluster-from-sepa.html Last, if possible you should verify that you can do an hdfs -ls when kinit'd to the principal you're using with the keytab you've specified in the processor properties configuration.

jfrazee · ‎07-07-2017

@elliot gimple I know it's not really what you want but there's an .rdd method you can call on a DataFrame in 1.6 so you could just do `df.rdd.countApprox()` on that. I'd have to look at the DAG more closely but I think the overhead is just going to be in converting DataFrame elements to Rows and not generation of the full RDD before `countApprox` is called -- not 100% sure about that though.

jfrazee · ‎07-07-2017

There are several Scala native JSON libraries available (json4s, play-json, circe, as well as others) so there are lots of ways to do this. With play-json, your solution might look like: import play.api.libs.json._ import play.api.libs.json.Reads._ import play.api.libs.functional.syntax._ val jsonStr = """{"Id":1,"Name":"Babu","Address":"dsfjskkjfs"}""" Person(id: Int, name: String, address: String) implicit val personReads: Reads[Person] = ( (JsPath \ "Id").read[Int] and (JsPath \ "Name").read[String] and (JsPath \ "Address").read[String] )(Person.apply _) jsonStr.validate[Person] match { case JsSuccess(person, _) => { // do something } case JsError(e) => { // do something } } With circe, your solution might look like: import io.circe._ import io.circe.generic.auto._ import io.circe.parser._ import io.circe.syntax._ val jsonStr = """{"Id":1,"Name":"Babu","Address":"dsfjskkjfs"}""" Person(id: Int, name: String, address: String) decode[Person](jsonStr) match { case Right(person) => { // do something with Person(1, "Babu", "dsfjskkjfs") } case Left(e) => { // handle error } } In your question you asked for a case class but wrote it out as a tuple, so just to close that loop, if you want to subsequently convert the data to a tuple you can do `Person.unapply(person).get`.

jfrazee · ‎07-06-2017

@Pavan Challa There's an answer for a related question that should address this for you. See: https://community.hortonworks.com/questions/109613/nifi-lookupattribute-and-updateattributes.html and https://gist.github.com/jfrazee/26deffba3a7d50e991495e223a020b93

jfrazee · ‎07-06-2017

@Pavan Challa I've created a template [1] that sketches out how to satisfy your use case. In it I used an ExecuteScript [2] to evaluate the attribute expression from the to.path in the XML file. I do think it'd be better/easier/more perspicuous to just set the directory in PutFile as `/var/tmp/${country}/${datasource}/csv` but figured it was worth showing you how you could do this anyway. The (default) expression engine for the lookups [3] is rather simple so you can't do conditional lookups like you're suggesting, but you can, however, do what you want using the index of the properties. There is a way to enable XPath in the underlying libraries but that'll require some internal code changes and won't help you immediately. 1. https://gist.github.com/jfrazee/26deffba3a7d50e991495e223a020b93#file-lookupattribute_xml_putfile_example-xmllookupattribute_xml_putfile_example-xml 2. https://gist.github.com/jfrazee/26deffba3a7d50e991495e223a020b93#file-evalexp-groovy 3. http://commons.apache.org/proper/commons-configuration/apidocs/org/apache/commons/configuration2/tree/DefaultExpressionEngine.html

jfrazee · ‎07-05-2017

@Roger Young The newer APIs assume you have a DataFrame and not an RDD so the easiest thing to do is to import the implicits from either sqlContext.implicits._ or spark.implicits._ and then either call .toDF on the initial load or create a DataFrame object from your training RDD. You could alternatively use LogisticRegressionWithSGD or LogisticRegressionWithLBFGS which can operate on RDDs but then you'll have to convert your input to LabeledPoints. FWIW, I'd make sure to convert the columns in your training data to their respective data types just to make sure that your continuous variables are treated as such and not categorical. import spark.implicits._ ... val training = sc.textFile("hdfs:///ford/fordTrain.csv") val df = training.toDF // fixup your data to ensure your columns are the expected data type ... val lrModel = lr.fit(df) ...

jfrazee · ‎06-27-2017

@Rajasekaran Dhanasekaran How to do this is going to vary a lot of from model to model. So, e.g., RandomForestRegressionModel has .featureImportances whereas LogisticRegressionModel only has the feature weight matrix .weights. To make things worse, there are very specific conditions under which it's possible to interpret model coefficients; if any features are correlated, "multicollinearities" arise and a parameter might be bigger or smaller, significant or non-significant depending on the correlated one. So it's not necessarily meaningful if you don't ensure there's no multicollinearity. Overall a model can still perform well even if this is the case though so you have to be careful. If you're certain that you've eliminated such correlations, then in regression regimes the model coefficients can be usually be interpreted somewhat directly. Since it sounds like you're doing binary classification (buy or not buy) you'll want to look into "how to interpret logistic regression coefficients". If you're throwing hundreds or thousands of features at your problem, interpreting the coefficients isn't going to work well though. So, if you'd like to better understand how your features explain your data, I'd recommend looking at things like the ChiSqSelector feature selector or dimensionality reduction techniques such as PCA. You can use these at development time to more perspicuously answer questions like "what features best explain my data?".

jfrazee · ‎06-26-2017

@Pavan Challa LookupAttribute is not going to interpolate the values being returned, just the lookup keys themselves. So, the way you should do this is do the lookup on country and datasource and then in the PutFile directory property have /var/tmp/${country}/${datasource}/csv as the value (instead of in the XML file). If you must have the full to-path being pulled out the config, you might be able to do a follow on UpdateAttribute or ExecuteScript to get the country and datasource interpolated.

jfrazee · ‎06-26-2017

@Arsalan Siddiqi The standard answer for delay modeling is to model the delay times using an exponential distribution. There's a an analytical Bayesian solution (i.e., no MCMC) to this or you can use GeneralizedLinearRegression from MLlib with the "gamma" family (since the exponential is a special case of the gamma with alpha = 1). There's probably an alternative way to think about the problem in terms of the number of delayed batches, which could be analyzed using poissons. Without knowing more about the goals it's hard to say what makes more sense. You can always squint at something and turn it into a classification problem (different classes or buckets for different ranges of data; i.e., quantizing), but there is an order to things like delays and total counts, and your standard classification regimes don't take this into account (there is an ordered logit, but AFAICT MLlib doesn't have this out of the box). That said, such an approach often works (what counts as adequate performance is an empirical matter); so if the aforementioned approach is beyond your current reach, classification could be acceptable since for a business application (vs. scientific inquiry) it's as important to understand what you've done as it is to be correct. Note: For these regression approaches using non-gaussian error distributions, you will likely need to transform the fitted parameters for them to be interpretable.

Online	Offline
Last Visited	‎09-18-2017 09:10 PM

Member Since	‎02-22-2016 03:57 PM
Last Visited	‎09-18-2017 09:10 PM
Posts	60
Kudos received	62

Cloudera Community

Re: How can I send FlowFile content to String in J...

Re: non-Kerberized HDF communication with Kerberiz...

Re: Is there a way to do a count Approx for a data...

Re: update attributes with other existing attribut...

Re: NiFi LookupAttribute and UpdateAttributes

Re: How can I send FlowFile content to String in J...

Re: non-Kerberized HDF communication with Kerberiz...

Re: Is there a way to do a count Approx for a data...

Re: How to convert json data into case class usin...

Re: update attributes with other existing attribut...

Re: NiFi LookupAttribute and UpdateAttributes

Re: SparkML error type mismatch

Re: Is there a way to find the weights of every fe...

Re: NiFi LookupAttribute and UpdateAttributes

Re: Spark Machine Learning for performance predict...