Support Questions
Find answers, ask questions, and share your expertise

Best practice to load multiple client data into Hadoop

Solved Go to solution

Best practice to load multiple client data into Hadoop

We are creating POC on Hadoop framework .We want to load data of multiple client into Hive tables.

As of now, we have separate database for each client on SQL Server. This infrastructure will remain same for OLTP. Hadoop will be used for OLAP. We have some primary dimension tables which are same for each client. All client database has schema. These tables have same primary key value. Till now, this was fine as we have separate database for client. Now we are trying to load multiple client data into same data container (Hive tables). Now we will have multiple row with same primary key value if we load data directly into Hive from multiple SQL Server databases through Sqoop job. I am thinking to use the surrogate key in Hive tables but Hive does not support auto increment but can be achieved with UDF.

We don't want to modify the SQL Server data as it's running production data.

a. What are the standard/generic way/solution to load multiple client data into Hadoop ecosystem ? We never want that data of different clients gets mixed. Referential constraints are also missing on Hive.

b. How primary key of sql server database table can be mapped easily to Hadoop Hive table so data we can pick data by client name ?

c. How we can ensure that one client is never able to see the data of other client?

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Best practice to load multiple client data into Hadoop

@Nirvana India

1) You can have the same setup in Hive. You can create separate database in HCatalog and use sqoop to import client tables. 1 to 1 mapping

2) You can have same table structure in Hive as SqlServer (Sqoop will take care of creating tables in Hive)

3) Ranger with Hive policy will provide the authorization at the table and column level. You can use views for row level control.

PS: Hive is different from RDBMS and I am sure you are not comparing Sql server and Hive

Good read

View solution in original post

7 REPLIES 7
Highlighted

Re: Best practice to load multiple client data into Hadoop

@Nirvana India

1) You can have the same setup in Hive. You can create separate database in HCatalog and use sqoop to import client tables. 1 to 1 mapping

2) You can have same table structure in Hive as SqlServer (Sqoop will take care of creating tables in Hive)

3) Ranger with Hive policy will provide the authorization at the table and column level. You can use views for row level control.

PS: Hive is different from RDBMS and I am sure you are not comparing Sql server and Hive

Good read

View solution in original post

Highlighted

Re: Best practice to load multiple client data into Hadoop

Hi Neeraj,

Yes, we are not comparing SQL Server and Hive.

Could you tell us what if we try to load data of all the clients into a single database.

Will there be any disadvantages of doing this?

We are planning to compare the data of different clients in future. Keeping in same database will share the same schema for all client.

Thanks

Highlighted

Re: Best practice to load multiple client data into Hadoop

@Nirvana India There is no harm in keeping data in the same database but different tables.

Highlighted

Re: Best practice to load multiple client data into Hadoop

We can not create same table for multiple clients. like T_User_Client1, T_User_Client2.

What if we create a single table T_User and we add a new column clientId to identify the client. We add this column on all tables or introduce some sort of surrogate to keep the uniqueness of all data. In this case, we will have single schema & if we have to change schema then not much work is needed

Highlighted

Re: Best practice to load multiple client data into Hadoop

@Nirvana India I did not say same table. I said same database and different tables. You can just map the tables from sql server into hive by using sqoop. If this solves the purpose then I wont go into the complication of adding new column.

Highlighted

Re: Best practice to load multiple client data into Hadoop

- a/c) as Neeraj says Hive provides access controls You could for example create one database per client.

Access Controls on table column level can be done with Ranger or SQLStdAuth (or the simple FileSystem Authentication)

https://community.hortonworks.com/content/kbentry/597/getting-started-with-sqlstdauth.html

- if the client data is in the same table you have a bigger problem, hive does not yet provide row level authentication however you could create one view for each customer with a where clause. Most likely not feasable for large number of clients.

b) Don't understand this completely, the SQL Server primary key contains the clientid? You can put this into Hive the same way you use it in SQL Server however you have no row level access controls. If that is what you are asking.

Also as you say there are no referential constraints which is actual very normal for warehousing databases. So you need to do the constraint check during loads.

Highlighted

Re: Best practice to load multiple client data into Hadoop

Mentor

@Nirvana India has this been resolved? Can you post your solution or accept best answer?