We have a marketing report in form of tabular data set which schema looks like this:
|Page Viewed||1472688038||489687||www.abc.com||Sample Data||Sample Data|
|Page Viewed||1472688052||118805||www.abc.com||Sample Data||Sample Data|
|Request Information Click||1472688056||192674||www.abc.com||Sample Data||Sample Data|
|Page Viewed||1472688087||204231||ww.123.com||Sample Data||Sample Data|
|Page Viewed||1472688161||76081||www.abc.com||Sample Data||Sample Data|
|Page Viewed||1472688219||186081||www.abc.com||Sample Data||Sample Data|
|Page Viewed||1472688236||83259||www.google.co.in||Sample Data||Sample Data|
|Page Viewed||1472688310||61410||www.tuv.in||Sample Data||Sample Data|
We need to write a map reduce program in order to find out the highest frequency of Initial_referring source site in order to find out which website most effective ad platform.
I am able to do this problem in Hive and pig but was not able to get the correct result in MapReduce program.
Any reference or piece of similar code can help.
Assuming that when you say "tabular", your file is comma/tab/pipe/etc.. delimited. A simple word-count program should suffice.
A nice posting with ways to achieve this using any of Hive, Pig, R, Spark, MapReduce (java), MapReduce(Python) may be found in the below link. The page formatting is not great, but the content is informative
As always, if you find this post useful, don't forget to "accept" the answer.