About dzaratsian

dzaratsian · ‎01-02-2017

Hi @rathna mohan, for Phoenix 4.0 and above, you'll want to use this syntax: HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar /usr/hdp/2.4.2.0-258/phoenix/phoenix-4.4.0.2.4.2.0-258-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --view UDM_TRANS --input /home/hadoop1/sample.csv Make sure to change the path to hbase-protocol.jar and the path to HBase conf to match your environment. Give this a try and let me know if it helps. Additional Link for Phoenix Bulk Loading: https://phoenix.apache.org/bulk_dataload.html

dzaratsian · ‎12-19-2016

@Johnny Fugers In this context, "predicting purchase", could mean a few different things (and ways that we could go about it). For example, if you are interested in predicting whether person 1 will purchase product A, then you can look their purchase history and/or you can look at similar purchases across a segment of customers. In the first scenario, you are basically working with probabilities (i.e. If I buy peanut butter every time I go to the store, then there's a high probability that I'll buy it on my next visit). Your predictive model should also take in to consideration other factors such as time of day, month (seasonality), storesID, etc. If you create a model for every customer, this could get expensive from a compute standpoint, so that is why many organizations segment customers into groups/cohorts based on behavior similarities. Predictive models are then built against these segments. A second approach would be to use market basket analysis. For example, when customer A purchases cereal, how likely are they to purchase milk. This factors in purchases across a segment of customers to look for "baskets" of similar purchases.

dzaratsian · ‎12-16-2016

@Johnny Fugers This could go a few different ways depending on the business objective. Is there a reason why you are asking about a supervised approach? If you are open to any ML approach, here's what I'd recommend... Recommendation Engine: The first thing that comes to mind is to use a recommendation engine. In spark 1.6, there's a collaborative filtering algorithm. This algorithm makes a recommendation based on a user's behavior and their similarity to other users. Frequent Patterns: Another interesting option, would be to perform frequent pattern mining. This examples provides a way to identify frequent item sets and association rules. Clustering: There are several additional clustering algorithms in Spark, which you can leverage to identify similar customer cohorts. You could also look to identify the total value of these cohorts based on their overall spending, or you could try to identify who they are based on purchases (i.e. single adults, elderly, parents, etc.). You might also want to check out time series analysis / forecasting if you want to predict spending, growth/seasonal patterns, etc. So most of my recommendations are to use an unsupervised approach, but that is what is a good fit for this type of data and use cases similar to this. ..If you'd prefer to take a supervised approach, you could always use a decision tree, random forest, or similar and try to predict the purchase value based on store, time of day, day of week, customer, and maybe even through in a customer segment variable (which you can derive using the above approach 🙂 ).

dzaratsian · ‎10-27-2016

@Roberto Sancho You're schema structure is close, but you need to make a few modifications, like this: import org.apache.spark.sql.types._ val data = sc.parallelize("""{"CAMPO1":"xxxx","CAMPO2":"xxx","VARIABLE":{"V1":"xxx"}}""" :: Nil) val schema = (new StructType) .add("CAMPO1", StringType) .add("CAMPO2", StringType) .add("VARIABLE", (new StructType) .add("V1", StringType)) sqlContext.read.schema(schema).json(data).select("VARIABLE.V1").show() Please let me know if this works for you. Thanks!

dzaratsian · ‎10-21-2016

@jordi You can use the PostHTTP processor to send json (or another format) to Grafana using their API. Depending on your use case, you may want to add Kafka in there as the intermediate message/routing bus, especially if you are sending that data to other endpoints/services in addition to Grafana. An alternative option would be to send the data into Solr (using the PutSolrContentStream NiFi processor). From Solr, you can use the Banana UI to create realtime dashboard. If this answers your question, can you please click the "Accept" response. 🙂 And if you have additional questions or would like more detail, please let me know.

dzaratsian · ‎10-21-2016

@jordi There are a few ways to do this within NiFi, but I would recommend using the ExecuteScript processor. You can use this processor to execute a python script (which is in the link you provided). The response from pubnub can be captured using python (as json) and the routed to the EvaluateJsonPath processor, where the json variables can be parsed and processed as part of your NiFi flow. NiFi can also route the json or parsed values to HDFS, HBase, Hive, etc using the corresponding processors. Let me know if this helps.

dzaratsian · ‎10-20-2016

Here's example python code that you can run from within NiFi's ExecuteProcess processor: import json import java.io from org.apache.commons.io import IOUtils # Get flowFile Session flowFile = session.get() # Open data.json file and parse json values filename = flowFile.getAttribute('filename') filepath = flowFile.getAttribute('absolute.path') data = json.loads(open(filepath + filename, "r").read()) data_value1 = data["record"]["value1"] # Calculate arbitrary new value within python new_value = data_value1 * 100 # Add/Put values to flowFile as new attributes if (flowFile != None): flowFile = session.putAttribute(flowFile, "from_python_string", "python string example") flowFile = session.putAttribute(flowFile, "from_python_number", str(new_value)) session.transfer(flowFile, REL_SUCCESS) session.commit()

dzaratsian · ‎10-20-2016

@Aruna dadi I have an example NiFi flow that may be helpful: https://github.com/zaratsian/nifi_python_executescript It shows how to execute a python script within a NiFi flow using the ExecuteProcess processor that @Matt Burgess mentioned. There are ways to process both the flowfile and the attributes as part of your python program, but you need to add a java command to initialize the flowfile object within python. Also as @Matt Burgess mentioned, you can use python to make an HTTP request directly. I prefer to use the "requests" package, such as: import requests r = requests.get("http://yourURL.com") You can also you a POST , PUT, or DELETE, HEAD, OPTIONS...

dzaratsian · ‎10-18-2016

@Vijay Kanth I assume you are running scala/spark code? Also, just as an FYI, the link that you shared above is for the latest version of Spark (which is currently 2.0.1). HDP 2.3 is an older version of the Hortonworks platform and it runs Spark 1.4.1. If you want to use the latest version of Spark, you can upgrade to HDP 2.5. But you should not need to upgrade to create a dataframe within Spark. Here's spark/scala code that I used within HDP 2.3 to generate a sample dataframe: val df = sqlContext.createDataFrame(Seq( ("1111", 10000,"M"), ("2222", 20000,"F"), ("3333", 30000,"M"), ("4444", 40000,"F") )).toDF("id", "income","gender") df.show() Try this out and let me know if it works.

dzaratsian · ‎10-13-2016

@Deepak Subhramanian I got this to work in python 2.6 by following these steps: 1.) Download the graphframes-0.2.0-spark1.6-s_2.10.zip file from here 2.) Download the graphframes-0.2.0-spark1.6-s_2.10.jar file from here 3.) Unzip graphframes-0.2.0-spark1.6-s_2.10.zip 4.) Navigate to the python directory: cd ./graphframes-0.2.0-spark1.6-s_2.10/python 5.) Zip up the contents contained within this directory: zip mypyfiles.zip * -r 6. Launch pyspark: ./bin/pyspark --py-files mypyfiles.zip --jars graphframes-0.2.0-spark1.6-s_2.10.jar Give that a shot - let me know how it goes.

Online	Offline
Last Visited	‎04-12-2018 06:53 PM

Member Since	‎07-18-2016 12:31 PM
Last Visited	‎04-12-2018 06:53 PM
Posts	94
Kudos received	90

Cloudera Community

Re: Installation and usage of Databricks Spark-CSV...

Re: Spark-sklearn integration

Re: How to make hive queries including scala and p...

Re: Structured Streaming writestream append to fil...

Re: Combine csv files with one header in a csv fi...

Re: is there any way to load the csv file into pho...

Re: Groceries - Supersied algorithm proposal in Sp...

Re: Groceries - Supersied algorithm proposal in Sp...

Re: StructType schema spark on JSON

Re: Real time data from sensors

Re: Real time data from sensors

Re: How to pass flowfile to a Python curl command ...

Re: How to pass flowfile to a Python curl command ...

Re: I'm not able to create data frame with the inf...

Re: Graphframes with pyspark