Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Sqoop Import Process

Highlighted

Sqoop Import Process

Contributor

I have a general question about how sqoop works.

When you run an import, does sqoop take a snapshot of the table at that time and import that result? Or does it keep an active window of the data?

If the table is rapidly changing, would that cause issues with the data import?

My import query looks like this: SELECT * FROM (SELECT *, ROW_NUMBER() OVER (ORDER BY COL1) R FROM TABLE) A WHERE \$CONDITIONS

Upon doing some QA on the imported data to HDFS, I am seeing the overall row count is close to the source table, but the data is inconsistent. For example, there are plenty of missing rows and duplicated rows. This is a large table sitting at ~560 million rows.

I am calculating the row number in order to get a perfect split-by, but I think that may be where the problem is coming from. It takes a few minutes to calculate the row numbers and my hunch is that while that's happening, the table is being updated at the source and it is throwing the row numbers and subsequent import off.

Any thoughts?

5 REPLIES 5
Highlighted

Re: Sqoop Import Process

Super Guru
@Josh Persinger

Sqoop does not make a snapshot before moving data. By default sqoop uses "read committed isolation" level to import data. So if your data is committed, it will be read.

Re: Sqoop Import Process

Contributor

Can you please explain further what read committed isolation is?

Highlighted

Re: Sqoop Import Process

Expert Contributor

Does this table have a timestamp column or a column that can be indexed as a highest value column. If you do, then sqoop job would be the most efficient and consistent way to move data from source to you destination, without loosing data.

Please read this tutorial to extract data to your source in a consistent way

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_dataintegration/content/incrementally-up...

Highlighted

Re: Sqoop Import Process

Contributor

Unfortunately there is no timestamp or any other good candidate for a split-by, which is why I was trying to generate row numbers on the fly

Highlighted

Re: Sqoop Import Process

Expert Contributor

Can you write triggers that can copy the data into a newer table and may be you can introduce the timestamp and mark the transactions if it was update, insert or a delete. I know it may not be possible to influence the source databases..

Don't have an account?
Coming from Hortonworks? Activate your account here