I have a MapReduce job that is calculating metrics around a customer through time (for each month that they are a customer with our company). Once completed the access pattern will likely be by year and month, so I was thinking that I should create a partition based on that.
I'm new to the custom partitioning stuff and what I'm reading is the partition step comes before the reduce step (which I need the key to be customer id). I'm wondering if there is a way to create the partition after the reduce step without creating a new MapReduce job afterward.
Partitioning runs after mapper and before Reducer. there is no getting around that. But I am a little confused about what you are trying to do so please bear with me. The way I look at it, you can create a custom partitioner where values for each key based on month and year will always go to same reducer so the reducer output will be such that values for same month and year will be in the same file making your access queries scan fewer files. Am I missing something? I am still confused what new partitioning post a reduce step will achieve.
And one quick suggestion. Since this seems like a new job, why not write this in Spark? The API is much easier and more flexible to work with specially with such requirements.
The way I'm reducing the file (By Customer Id) will not be how it will be accessed in Hive. It will be by YearMon and CustomerId... So maybe if I use both of those as my key then it will work as I intend? Year/Month being the higher level and Customer Id being a subset.
Then I'll set up an external Hive table to reference that HDFS file right?
I wanted to use Spark actually but we are on 1.6.4 of Spark and I needed to finish before our scheduled upgrade. And as of this writing, I'm a much better MapReduce programmer than Spark (but looking to change that).