why sqoop take long type to start the map reduce??
this is the query
SELECT MIN(ID_INTERNO_PE), MAX(ID_INTERNO_PE) FROM (select ID_INTERNO_PE,MI_FECHA_FIN_MES,COD_NRBE_EN,COD_LINEA,ID_GRP_PD,MI_NUM_TOT_AC_ACT,MI_NUM_AC_CAN,MI_NUM_AC_SUS,MI_SDO_AC_P,MI_NUM_AC_P,MI_DIA_AC_P,MI_INT_DEV_ACR_D,MI_INT_DEV_DEU_D,MI_COMIS_APL_D,MI_TOT_MOV_D,MI_TOT_MOV_H,MI_TOT_IMP_MOV_D,MI_TOT_IMP_MOV_H from RDWC01.MI_CLTE_ECO_GEN where MI_FECHA_FIN_MES > TO_TIMESTAMP('2010-01-01', 'YYYY-MM-DD HH24:MI:SS.FF') AND MI_FECHA_FIN_MES <= TO_TIMESTAMP('2016-05-31 00:00:00.0', 'YYYY-MM-DD HH24:MI:SS.FF') AND (1 = 1) AND COD_RL_PERS_AC = 01 AND COD_LINEA in ('01','03','04','05') AND COD_NRBE_EN = '3008') t1
this is the configuration job:
--num-mappers 20 \ --direct \ --incremental append \ --check-column MI_FECHA_FIN_MES \ --last-value $LAST_ROW \ --split-by ID_INTERNO_PE \ --fetch-size 30000 \ --as-textfile \ --target-dir $SQOOP_OUT \ --outdir $SQOOP_OUTDIR \ --bindir $SQOOP_BINDIR \
but when the map reduce start it take 5minutes, but before to start it take 10 minutes why???
This is a nice question. The time taken is to allocate the containers to run 20 parallel threads. It might be your cluster has lot of things running when you are running this job and so it takes time for your job to get allocated 20 containers. One direct tuning you can try is setting up different values for --num-mappers starting with excluding this parameter first:)
You can paste the portion(s) of the application log where it waits the maximum amount of time if the above approach does not work.
Split-by is an important factor to achieve parallelism therefore could you please give alias explicitly to your statement so that DAG can clear some confusion if it have before launching job though your query.
for example MIN(ID_INTERNO_PE) as ID_INTERNO_PE.
Secondly your query is not a good candidate for data load(as its loading min and max value) but if this is the actual use case then try using "--boundary-query" (to have more control on parallelism) and that way you can avoid executing your subquery two times.
so you recomend me to type somenthing like this
SELECT MIN(ID_INTERNO_PE) as id, MAX(ID_INTERNO_PE) as id FROM (select ID_INTERNO_PE,MI_FECHA_FIN_MES,COD_NRBE_EN
Please try this (I added the compression-codec and changed the --split-by) :
--num-mappers 20 \
--compression-codec org.apache.hadoop.io.compress.zlib \
--incremental append \
--check-column MI_FECHA_FIN_MES \
--last-value $LAST_ROW \
--split-by MI_FECHA_FIN_MES \
--fetch-size 30000 \
--target-dir $SQOOP_OUT \
--outdir $SQOOP_OUTDIR \
--bindir $SQOOP_BINDIR \
*** You can also try this after trying the above config
--boundary-query 'SELECT $LAST_ROW, MAX(MI_FECHA_FIN_MES) FROM RDWC01.MI_CLTE_ECO_GEN' \
and remove the --split-by
And check $CONDITIONS clause in sqoop and see if you can tune the query by utilizing sqoop's query rewrite for the mappers
**Also I have not run this so there might be some typo:)