Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Sqoop --split-by on a string /varchar column

Contributor

Can a non-numeric column be specified for a --split-by key parameter? What are the potential issues in doing so?

4 REPLIES 4

No, it must be numeric because according to the specs: "By default sqoop will use query select min(<split-by>), max(<split-by>) from <table name> to find out boundaries for creating splits." The alternative is to use --boundary-query which also requires numeric columns. Otherwise the Sqoop job will fail. If you don't have such a column in your table the only workaround is to use only 1 mapper: "-m 1".

Contributor

The answer is outdated. It is possible to use a character attribute as split-by attribute.

You only need to add -Dorg.apache.sqoop.splitter.allow_text_splitter=true

after your 'sqoop job' statement like this:

sqoop job -Dorg.apache.sqoop.splitter.allow_text_splitter=true \\
    --create ${JOB_NAME} \\
    -- \\
    import \\
    --connect \"${JDBC}\" \\
    --username ${SOURCE_USR} \\
    --password-file ${PWD_FILE_PATH} \\

no guarantees though that sqoop splits your records evenly over your mappers though.

Explorer

For huge number of row the above options will cause duplicates in the results set.

Explorer

Thank you @Krish E, did you sort it out now? I am having the same issue. What is your table's size?

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.