Support Questions

Find answers, ask questions, and share your expertise

Limiting Sqoop imports to some time intervals

avatar
Rising Star

Hi,

we are trying to setup ingest of data from databases into Hive using Sqoop. Problem is that databases are used in production and we cannot use them too heavily during some working hours. There are many tables and some of them are quite huge ( > 2 GRows) so it is probable that we cannot ingest them all during the time window available. It is difficult to create some general delta queries that would run some given amount of time and no longer.

I am thinking about possibility to implement such feature directly into Sqoop. I am not very familiar with Sqoop implementation, but I guess there is some loop where a row gets loaded from JDBC resultset, converted and stored into Hive table.

All that would be required is to place a check in this loop and wait/sleep for some time if this is happening during the database working hours. This way the ingest will run at the full speed outside of the working hours and will be significantly reduced (but still running) when database is not supposed to be overloaded.

What do you think? Does this sound like a feature that could be useful to someone else as well?

Thanks,

Pavel

1 ACCEPTED SOLUTION

avatar
Rising Star

We have solved this problem with our environment in a number of different ways. With Falcon, we have Sqoop imports on a regular frequency that run during determined "not busy" time intervals. We've also done some logic in bash scripts run through Oozie that will execute, but determine "not safe" to run at the moment, and will sleep or terminate for that instance, and try again later.

If you have a window, say 3am-5am, in which you could feasibly connect and pull data, you could set up a sleep/wait loop until either a specific exact time has passed or the system is available. Plenty of options, definitely feasible what you mentioned!

View solution in original post

3 REPLIES 3

avatar
Rising Star

We have solved this problem with our environment in a number of different ways. With Falcon, we have Sqoop imports on a regular frequency that run during determined "not busy" time intervals. We've also done some logic in bash scripts run through Oozie that will execute, but determine "not safe" to run at the moment, and will sleep or terminate for that instance, and try again later.

If you have a window, say 3am-5am, in which you could feasibly connect and pull data, you could set up a sleep/wait loop until either a specific exact time has passed or the system is available. Plenty of options, definitely feasible what you mentioned!

avatar
@Pavel Benes

This is an interesting idea. Typically there are various ways to ensure that large quantity of data is ingested successfully, like running frequent smaller updates, running more mappers during sqoop job, increasing infrastructure (obviously last resort).

I am not able to see how we reliably pull the data by setting a timer based stop on the process. This is definitely worth the discussion though.

Can you may be open a new thread (as idea and not as question) with some details around design? We should be able to brainstorm this.

avatar
Master Mentor

@Pavel Benes has this been resolved? Can you post your solution or accept best answer?