Support Questions

Zubair123 · ‎03-02-2025

Hi,

Currently i have 1 TB of data in HDFS where i am trying to migrate into S3, i am using below command, when ever i run this command job runs very fast for 3 hours then it slows down for a week still it is running, i started last week to run this job still it is running and very slow, is this expected behavior.

nohup hadoop distcp -Dfs.s3a.access.key="$AWS_ACCESS_KEY_ID" -Dfs.s3a.secret.key="$AWS_SECRET_ACCESS_KEY" -Dfs.s3a.fast.upload=true -Dfs.s3a.fast.buffer.size=1048576 -Dfs.s3a.multipart.size=10485760 -Dfs.s3a.multipart.threshold=10485760 -Dmapreduce.map.memory.mb=8192 -Dmapreduce.map.java.opts=-Xmx7360m -m=300 -bandwidth 400 -update hdfs:<....> s3a://<.......>

VidyaSargur · ‎03-04-2025

@Zubair123, Welcome to our community! To help you get the best possible answer, I have tagged in our HDFS experts @willx @ChethanYM who may be able to assist you further.

Please feel free to provide any additional information or details about your query, and we hope that you will find a satisfactory solution to your question.

Regards,

Vidya Sargur,
Community Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Community Guidelines
How to use the forum

willx · ‎03-04-2025

You may want to collect yarn application log to understand what happened after 3 hours, for example, it may be a yarn resource issue or stuck containers.

1. Open console debug log and re-run distcp and save the output

export HADOOP_ROOT_LOGGER=DEBUG,console

nohup hadoop distcp -Dfs.s3a.access.key="$AWS_ACCESS_KEY_ID" -Dfs.s3a.secret.key="$AWS_SECRET_ACCESS_KEY" -Dfs.s3a.fast.upload=true -Dfs.s3a.fast.buffer.size=1048576 -Dfs.s3a.multipart.size=10485760 -Dfs.s3a.multipart.threshold=10485760 -Dmapreduce.map.memory.mb=8192 -Dmapreduce.map.java.opts=-Xmx7360m -m=300 -bandwidth 400 -update [hdfs path] [s3a path] > distcp_console.out 2>&1 &

2. Collect yarn application logs:

yarn logs -applicationId [applicationID] > /tmp/distcp_application.out

3. If there are stuck yarn containers, collect jstack of the container pid, refer to below post

https://my.cloudera.com/knowledge/How-to-collect-thread-dumps-for-stuck-YARN-containers-via-jstack?i...

Zubair123 · ‎03-11-2025

@willx i really appreciate for response, looks like i don't have an access to the Article.

https://my.cloudera.com/knowledge/How-to-collect-thread-dumps-for-stuck-YARN-containers-via-jstack?i...

Can you please share the solution i really appreciate for help.

Thanks,

Zubair.

VidyaSargur · ‎03-10-2025

@Zubair123, Did the response help resolve your query? If it did, kindly mark the relevant reply as the solution, as it will aid others in locating the answer more easily in the future.

Regards,

Vidya Sargur,
Community Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Community Guidelines
How to use the forum

Zubair123 · ‎03-11-2025

@VidyaSargur i dont have an access to the article waiting for share solution.

VidyaSargur · ‎03-12-2025

@Zubair123, This article is available exclusively for our customers. If you're a customer, please contact our customer support team for more details. If you’re not, our sales team would happily assist you with any information you need.

Regards,

Vidya Sargur,
Community Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Community Guidelines
How to use the forum

Support Questions

Migrating HDFS data in S3