Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Sqoop Import Process

Contributor

I have a general question about how sqoop works.

When you run an import, does sqoop take a snapshot of the table at that time and import that result? Or does it keep an active window of the data?

If the table is rapidly changing, would that cause issues with the data import?

My import query looks like this: SELECT * FROM (SELECT *, ROW_NUMBER() OVER (ORDER BY COL1) R FROM TABLE) A WHERE \$CONDITIONS

Upon doing some QA on the imported data to HDFS, I am seeing the overall row count is close to the source table, but the data is inconsistent. For example, there are plenty of missing rows and duplicated rows. This is a large table sitting at ~560 million rows.

I am calculating the row number in order to get a perfect split-by, but I think that may be where the problem is coming from. It takes a few minutes to calculate the row numbers and my hunch is that while that's happening, the table is being updated at the source and it is throwing the row numbers and subsequent import off.

Any thoughts?

5 REPLIES 5

Super Guru
@Josh Persinger

Sqoop does not make a snapshot before moving data. By default sqoop uses "read committed isolation" level to import data. So if your data is committed, it will be read.

Contributor

Can you please explain further what read committed isolation is?

Expert Contributor

Does this table have a timestamp column or a column that can be indexed as a highest value column. If you do, then sqoop job would be the most efficient and consistent way to move data from source to you destination, without loosing data.

Please read this tutorial to extract data to your source in a consistent way

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_dataintegration/content/incrementally-up...

Contributor

Unfortunately there is no timestamp or any other good candidate for a split-by, which is why I was trying to generate row numbers on the fly

Expert Contributor

Can you write triggers that can copy the data into a newer table and may be you can introduce the timestamp and mark the transactions if it was update, insert or a delete. I know it may not be possible to influence the source databases..

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.