Support Questions

Find answers, ask questions, and share your expertise

Sqoop --split-by on a string /varchar column

avatar
Expert Contributor

Can a non-numeric column be specified for a --split-by key parameter? What are the potential issues in doing so?

4 REPLIES 4

avatar
Master Guru

No, it must be numeric because according to the specs: "By default sqoop will use query select min(<split-by>), max(<split-by>) from <table name> to find out boundaries for creating splits." The alternative is to use --boundary-query which also requires numeric columns. Otherwise the Sqoop job will fail. If you don't have such a column in your table the only workaround is to use only 1 mapper: "-m 1".

avatar
Rising Star

The answer is outdated. It is possible to use a character attribute as split-by attribute.

You only need to add -Dorg.apache.sqoop.splitter.allow_text_splitter=true

after your 'sqoop job' statement like this:

sqoop job -Dorg.apache.sqoop.splitter.allow_text_splitter=true \\
    --create ${JOB_NAME} \\
    -- \\
    import \\
    --connect \"${JDBC}\" \\
    --username ${SOURCE_USR} \\
    --password-file ${PWD_FILE_PATH} \\

no guarantees though that sqoop splits your records evenly over your mappers though.

avatar
Explorer

For huge number of row the above options will cause duplicates in the results set.

avatar
Contributor

Thank you @Krish E, did you sort it out now? I am having the same issue. What is your table's size?