Support Questions
Find answers, ask questions, and share your expertise

Schema design and load data using Spark

Schema design and load data using Spark


Hi there,

I have got a project where I need to decide the storage system and then schema design for the below problem scenario.

The below is the data which we receive as CSV file for every 1 hour with 1TB of data.

recipe_id,recipe_name, description, ingredient, active, updated_date, created_date

1, pasta, Italian pasta, tomato sauce, true, 2018-01-09 10:00:57, 2018-01-10 13:00:57

1, pasta, null, cheese, true, 2018-01-09 10:10:57, 2018-01-10 13:00:57

2, lasagna, layered lasagna, cheese, true, 2018-01-09 10:00:57, 2018-01-10 13:00:57

2, lasagna, layered lasagna, blue cheese, false, 2018-01-09 10:00:57, 2018-01-10 13:00:57 ….


1. XXX need to show a page with list of all the recipes and when user clicks on each of the recipe they want to show the Recipe page with their ingredients. They also want user to further click into each of the ingredient and see all the recipes linked to that ingredients.

  • Create a data model which can store this data to allow user to do the activities mentioned above. This data model needs to support millions of read per second.
  • Which persistence system will be the best for this scenario?
  • Write a Spark Job in Scala which can takes the CSV shown above and store that in the storage system of your choice using the data model you discussed above.

I know Hive will not support this real time requirement, because it is for batch processing only and there is no random access concept. I think I can go with Hbase but I am not sure about the data model. If i know the data model, I can write an application on Spark using Scala to export the data to Hbase as per the data model.

Could anyone please help with the storage system and the data model?

Appreciate your help!




Re: Schema design and load data using Spark

Super Collaborator

Will you receive 1TB per second, at the same time as you want to read millions of rows per second? Or is the insertion muss less frequent? Hive isn't neccessarly out, it depends on your cluster.

To me the description seems to be written with a relational approach in mind, where you model one table with recipes and one with the ingredients. And then a last one as cross table with the relation of ingredients to recipes. With Hbase I personally would go in a different way, as the access described is always via the recipe, never via the ingredients.

1 Table, maybe two columnfamilies, with the rowkey being your recipe id (or derived from it, dependent on the access pattern, if search runs be name, take the name as part of the rowkey as well), first column familiy is recipe information (name and description), the second family with the ingredients. You can store the ingredients as a column with the name of the ingredient and the value being the active true/false.

Or you still go via hdfs and hive, you simply stream the new data into the file in hdfs, in hive you create table definitions as external tables based on the csv file. you could create two table definitions, one for the recipes, one for the ingredients per recipe and then query them in spark directly, not necessarily via hive.