Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

sqoop append incremental query slow

sqoop append incremental query slow

Super Collaborator

Hi:

why sqoop take long type to start the map reduce??

this is the query

SELECT MIN(ID_INTERNO_PE), MAX(ID_INTERNO_PE) FROM (select ID_INTERNO_PE,MI_FECHA_FIN_MES,COD_NRBE_EN,COD_LINEA,ID_GRP_PD,MI_NUM_TOT_AC_ACT,MI_NUM_AC_CAN,MI_NUM_AC_SUS,MI_SDO_AC_P,MI_NUM_AC_P,MI_DIA_AC_P,MI_INT_DEV_ACR_D,MI_INT_DEV_DEU_D,MI_COMIS_APL_D,MI_TOT_MOV_D,MI_TOT_MOV_H,MI_TOT_IMP_MOV_D,MI_TOT_IMP_MOV_H from RDWC01.MI_CLTE_ECO_GEN where MI_FECHA_FIN_MES > TO_TIMESTAMP('2010-01-01', 'YYYY-MM-DD HH24:MI:SS.FF') AND MI_FECHA_FIN_MES <= TO_TIMESTAMP('2016-05-31 00:00:00.0', 'YYYY-MM-DD HH24:MI:SS.FF') AND  (1 = 1)  AND COD_RL_PERS_AC = 01 AND COD_LINEA in ('01','03','04','05') AND COD_NRBE_EN = '3008') t1

this is the configuration job:

--num-mappers 20 \
--direct \
--incremental append \
--check-column MI_FECHA_FIN_MES \
--last-value $LAST_ROW \
--split-by ID_INTERNO_PE \
--fetch-size 30000 \
--as-textfile \
--target-dir $SQOOP_OUT \
--outdir $SQOOP_OUTDIR \
--bindir $SQOOP_BINDIR \

but when the map reduce start it take 5minutes, but before to start it take 10 minutes why???

6 REPLIES 6

Re: sqoop append incremental query slow

@Roberto Sancho

This is a nice question. The time taken is to allocate the containers to run 20 parallel threads. It might be your cluster has lot of things running when you are running this job and so it takes time for your job to get allocated 20 containers. One direct tuning you can try is setting up different values for --num-mappers starting with excluding this parameter first:)

You can paste the portion(s) of the application log where it waits the maximum amount of time if the above approach does not work.

Thanks

Re: sqoop append incremental query slow

Super Collaborator

Hi, even if i put 5 mapper, still slow.

Re: sqoop append incremental query slow

Split-by is an important factor to achieve parallelism therefore could you please give alias explicitly to your statement so that DAG can clear some confusion if it have before launching job though your query.

for example MIN(ID_INTERNO_PE) as ID_INTERNO_PE.

Secondly your query is not a good candidate for data load(as its loading min and max value) but if this is the actual use case then try using "--boundary-query" (to have more control on parallelism) and that way you can avoid executing your subquery two times.

Re: sqoop append incremental query slow

Super Collaborator

so you recomend me to type somenthing like this

SELECT MIN(ID_INTERNO_PE) as id, MAX(ID_INTERNO_PE) as id FROM (select ID_INTERNO_PE,MI_FECHA_FIN_MES,COD_NRBE_EN

Re: sqoop append incremental query slow

@Roberto Sancho

Please try this (I added the compression-codec and changed the --split-by) :

--num-mappers 20 \

--direct \

--compression-codec org.apache.hadoop.io.compress.zlib \

--incremental append \

--check-column MI_FECHA_FIN_MES \

--last-value $LAST_ROW \

--split-by MI_FECHA_FIN_MES \

--fetch-size 30000 \

--as-textfile \

--target-dir $SQOOP_OUT \

--outdir $SQOOP_OUTDIR \

--bindir $SQOOP_BINDIR \

*** You can also try this after trying the above config

--boundary-query 'SELECT $LAST_ROW, MAX(MI_FECHA_FIN_MES) FROM RDWC01.MI_CLTE_ECO_GEN' \

and remove the --split-by

And check $CONDITIONS clause in sqoop and see if you can tune the query by utilizing sqoop's query rewrite for the mappers

**Also I have not run this so there might be some typo:)

Highlighted

Re: sqoop append incremental query slow

@Roberto Sancho did the solution worked?

Don't have an account?
Coming from Hortonworks? Activate your account here