Member since
06-09-2016
34
Posts
2
Kudos Received
0
Solutions
10-02-2017
11:54 PM
1 Kudo
@Johnny Fugers They are part of Hortonworks Data Platform (HDP) and is 100% open source under Apache license. In order to have support for these products for your enterprise, you can start from this link to explore the pricing for support and professional services. https://hortonworks.com/services/support/enterprise/ Phone contact : 1.408.675.0983
... View more
12-19-2016
02:23 PM
@Johnny Fugers In this context, "predicting purchase", could mean a few different things (and ways that we could go about it). For example, if you are interested in predicting whether person 1 will purchase product A, then you can look their purchase history and/or you can look at similar purchases across a segment of customers. In the first scenario, you are basically working with probabilities (i.e. If I buy peanut butter every time I go to the store, then there's a high probability that I'll buy it on my next visit). Your predictive model should also take in to consideration other factors such as time of day, month (seasonality), storesID, etc. If you create a model for every customer, this could get expensive from a compute standpoint, so that is why many organizations segment customers into groups/cohorts based on behavior similarities. Predictive models are then built against these segments. A second approach would be to use market basket analysis. For example, when customer A purchases cereal, how likely are they to purchase milk. This factors in purchases across a segment of customers to look for "baskets" of similar purchases.
... View more
10-29-2016
07:25 PM
1 Kudo
Pig runs map-reduce under the covers and this list of files is the output of a map-reduce job. You should also notice a 0 byte (no contents) file named _SUCCESS at the top of the list. That is just a flag saying the job was a success. Bottom line is that when you point your job or table to the the parent directory holding these files, it simply sees the union of all files together. So you can think logically of the parent directory as the "file" holding the data. Thus, there is never a need to concatenate the files on hadoop -- just point to the parent directory and treat it as the file. So if you make a hive table -- just point to the parent directory. If you load the data into a pig script -- just point to the parent directory. Etc. If you want to pull the data to an edge node, use the command hdfs dfs -getmerge <hdfsParentDir> <localPathAndName> and it will combine all of the m-001, m-002 ... into a single file. If you want to pull it to your local machine, use Ambari File Views, open the parent directory, click "+ Select All" and then click "concatenate". That will concatenate all into one file and download it from your browser. If this is what you are looking for, let me know by accepting the answer; else, let me know of any gaps.
... View more
09-27-2016
01:17 PM
This produces the results you want: RAW = LOAD 'filepath' USING PigStorage(';') as
(Employee:Chararray, Stock:Int, Furnisher:Chararray, Date:Chararray, Value:Double);
RANKING = rank RAW BY Employee, Date DENSE;
GRP = GROUP RANKING BY $0;
SUMMED = foreach GRP {
summed = SUM(RANKING.Value);
generate $0, summed as Ranksum;
}
JOINED = join RANKING by $0, SUMMED by $0;
FINAL= foreach JOINED generate $0, Employee, Stock, Furnisher, Date, Ranksum;
STORE FINAL INTO 'destinationpath' USING PigStorage(','); Let me know this is what you are looking for by accepting the answer. If I did not get the requirements correct, please clarify.
... View more
09-19-2016
02:45 PM
1 Kudo
https://spark.apache.org/docs/1.6.0/mllib-frequent-pattern-mining.html See the above article for a newer replacement for Apriori. Doesn't seem that algorithm is usually used in massively parallel modern systems. But here is a nice article on Market Basket Analysis: http://jayaniwithanawasam.blogspot.com/2015/08/market-basket-analysis-with-apache.html
... View more
08-25-2016
10:51 PM
@Johnny Fugers That question is little ambiguous. Do you mean if Hadoop provide out of the box tools where you push data and it tells you what distribution you have? The answer is no. But, normally like outside of Hadoop, you will assume a distribution for your data and then verify if data agrees with your assumption. That for sure you can do. Use Spark to do that. Check this link. Or use Python. Check PySpark also. Or even R.
... View more
08-24-2016
01:25 PM
3 Kudos
+1 on Recommender system. A more concrete example is "Building a Movie Recommendation Service with Apache Spark" below that walks you through an example. https://www.codementor.io/spark/tutorial/building-a-recommender-with-apache-spark-python-example-app-part1
... View more
08-09-2016
02:48 PM
2 Kudos
@Johnny Fugers Hive is great for typically BI queries. The scalability is limitless. When you get into the area of updates, I rather do those activities on phoenix and serve the end results back to hive for BI queries. Hive ACID is coming soon. Until that is available I would use the phoenix->Hive route. Use PIG for ETL. Where it gets interested is using a MPP database on Hadoop. that is where HAWQ comes in. It is a good low latency db engine which provided you some benefits from both hive and phoenix. It does not do all hive & phoenix capabilities. I would say it is a good happy medium. I hope that helps. When you go further into your journey you will start to ask question about security and governance. For security you will start with ranger & Knox. and goverance you will start with falcon/atlas/ranger.
... View more
08-08-2016
09:29 PM
João Souza, if you find some article can you share here? Many thanks!
... View more
08-03-2016
06:28 PM
1 Kudo
you can write a new script using regex to test this column and throw away bad fields or do it all in one step where you pass the date field to UDF and check for formatting
... View more