Hi there,
I have got a project where I need to decide the storage system and then schema design for the below problem scenario.
The below is the data which we receive as CSV file for every 1 hour with 1TB of data.
recipe_id,recipe_name, description, ingredient, active,
updated_date, created_date
1, pasta, Italian pasta,
tomato sauce, true, 2018-01-09 10:00:57,
2018-01-10 13:00:57
1, pasta, null, cheese,
true, 2018-01-09 10:10:57, 2018-01-10
13:00:57
2, lasagna, layered lasagna,
cheese, true, 2018-01-09 10:00:57,
2018-01-10 13:00:57
2, lasagna, layered lasagna,
blue cheese, false, 2018-01-09 10:00:57,
2018-01-10 13:00:57 ….
Requirement:
1. XXX need to show a page with list of all the recipes and when user clicks on each
of the recipe they want to show the Recipe page with their ingredients. They
also want user to further click into each of the ingredient and see all the
recipes linked to that ingredients.
- Create
a data model which can store this data to allow user to do the activities
mentioned above. This data model needs to support millions of read per second.
- Which persistence system will be the best for this scenario?
- Write
a Spark Job in Scala which can takes the CSV shown above and store that in the
storage system of your choice using the data model you discussed above.
I know Hive will not support this real time requirement, because it is for batch processing only and there is no random access concept. I think I can go with Hbase but I am not sure about the data model. If i know the data model, I can write an application on Spark using Scala to export the data to Hbase as per the data model.
Could anyone please help with the storage system and the data model?
Appreciate your help!
Thanks,
Krishna