Hi, I am new to hadoop and just started trying out few things in hadoop. So this question may be naive.
I tried to import data from Teradata table to Hive (ORC format table) using SQoop. Below is the command I used for import.
sqoop import \ --connect "jdbc:teradata://tdsrva/database=TEST,logmech=LDAP" \ --driver com.teradata.jdbc.TeraDriver \ --query 'select A.USER as id, B.DIM_I as dim_i from TEST.TBL1 B, TEST.TBL2 A where $CONDITIONS and substr(A.col1,1,15) = B.col1' \ --split-by dim_i \ --hcatalog-database DEFAULT \ --hcatalog-table sqoop_import_test \ --username test_user \ --password test1234 \ --fetch-size 10000 \ -m 1 \ --verbose
Initially I tried without -m, it created 4 maps by default. But in both the cases, the mapper is running for a long time ~ 15 mins without any response. In case of no -m option being specified, only 1 of 4 mappers gets completed in first 10 mins and the rest shows below error and does not even complete after 20 to 30 mins. So I had to manually kill the job.
"15/12/30 13:31:54 INFO mapreduce.Job: Task Id : attempt_1449790668797_14580_m_000000_2, Status : FAILED AttemptID:attempt_1449790668797_14580_m_000000_2 Timed out after 300 secs"
Mappers are struck at this below line in the log. It takes a long time (or it does not proceed further in some cases) after this line in the log. What does this do?
org.apache.hadoop.mapred.Task: Using ResourceCalculatorProcessTree : [ ]
The query response is lesser than 5 secs in Teradata when executed from bteq. But still the Sqoop job is too slow. I am not sure if it would even complete. Total record count is lesser than 10k. Is this the usual time taken for Sqoop jobs to import records from teradata to Hive?
Can you please suggest some tips to improve the performance? To begin with What should I look at to improve the performance or understand why the job is slow? What parameters can be added or tuned or removed?
Appreciate your help! Thanks.
Thanks for quick reply Scott. Thats my next step. I am going through TDCH documentation. But I would like to understand why SQoop is not fast enough (or is this the usual behaviour)? And also I would like to know where to begin with to understand the root cause of why this is running slow? And more on why it is struck in the step ResourceCalculatorProcessTree:?
Hi @R M,
The tasks were failed as they were idle for 300 secs. Looks like the FAILED tasks had issues pulling the data. It will be great if you can post the full logs of the failed task. Pull the logs with DEBUG as the log level.
This problem is fixed now.. There was no issue with the code or anything. I was told by admin that there was some network configuration issue and he changed that and the issue is now resolved. Not sure what was changed though.
Thanks for your help!
great, please try your sqoop query with --direct flag, should improve your performance in general. Other than that, please accept one of the answers to close the issue.