Support Questions

Find answers, ask questions, and share your expertise

Migrating HDFS data in S3

avatar
New Contributor

Hi, 

Currently i have 1 TB of data in HDFS where i am trying to migrate into S3, i am using below command, when ever i run this command job runs very fast  for 3 hours then it slows down for a week still it is running, i started last week to run this job still it is running and very slow, is this expected behavior.  

nohup hadoop distcp -Dfs.s3a.access.key="$AWS_ACCESS_KEY_ID" -Dfs.s3a.secret.key="$AWS_SECRET_ACCESS_KEY" -Dfs.s3a.fast.upload=true -Dfs.s3a.fast.buffer.size=1048576 -Dfs.s3a.multipart.size=10485760 -Dfs.s3a.multipart.threshold=10485760 -Dmapreduce.map.memory.mb=8192 -Dmapreduce.map.java.opts=-Xmx7360m -m=300 -bandwidth 400 -update hdfs:<....> s3a://<.......>

6 REPLIES 6

avatar
Community Manager

@Zubair123, Welcome to our community! To help you get the best possible answer, I have tagged in our HDFS experts @willx @ChethanYM  who may be able to assist you further.

Please feel free to provide any additional information or details about your query, and we hope that you will find a satisfactory solution to your question.



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Master Collaborator

You may want to collect yarn application log to understand what happened after 3 hours, for example, it may be a yarn resource issue or stuck containers.

1. Open console debug log and re-run distcp and save the output

export HADOOP_ROOT_LOGGER=DEBUG,console

nohup hadoop distcp -Dfs.s3a.access.key="$AWS_ACCESS_KEY_ID" -Dfs.s3a.secret.key="$AWS_SECRET_ACCESS_KEY" -Dfs.s3a.fast.upload=true -Dfs.s3a.fast.buffer.size=1048576 -Dfs.s3a.multipart.size=10485760 -Dfs.s3a.multipart.threshold=10485760 -Dmapreduce.map.memory.mb=8192 -Dmapreduce.map.java.opts=-Xmx7360m -m=300 -bandwidth 400 -update [hdfs path] [s3a path] > distcp_console.out 2>&1 &

2. Collect yarn application logs:

yarn logs -applicationId [applicationID] > /tmp/distcp_application.out

3. If there are stuck yarn containers, collect jstack of the container pid, refer to below post

https://my.cloudera.com/knowledge/How-to-collect-thread-dumps-for-stuck-YARN-containers-via-jstack?i...

 

avatar
New Contributor

@willx i really appreciate for response, looks like i don't have an access to the Article.

https://my.cloudera.com/knowledge/How-to-collect-thread-dumps-for-stuck-YARN-containers-via-jstack?i...

Can you please share the solution i really appreciate for help. 

Thanks,

Zubair.

avatar
Community Manager

@Zubair123, Did the response help resolve your query? If it did, kindly mark the relevant reply as the solution, as it will aid others in locating the answer more easily in the future. 



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
New Contributor

@VidyaSargur i dont have an access to the article waiting for share solution. 

avatar
Community Manager

@Zubair123, This article is available exclusively for our customers. If you're a customer, please contact our customer support team for more details. If you’re not, our sales team would happily assist you with any information you need. 



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community: