Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Best way to move data faster than the speed of a single disk from GPFS to HDFS?

Best way to move data faster than the speed of a single disk from GPFS to HDFS?

Explorer

We're trying to move data from GPFS to HDFS. Using HDFS put maxes out our write speed into HDFS at the speed of a single disk. Can we use distcp or some other tool to move data in parallel from GPFS to HDFS?

1 REPLY 1

Re: Best way to move data faster than the speed of a single disk from GPFS to HDFS?

Cloudera Employee

Unfortunately there are no tools that I am aware of that does that precanned.

 

You can always do the following in a script:

 

This will copy all files in a given directory to hadoop directory.

 

/copyParallel.sh /tmp/woody/test /user/woody/test

 

This will copy all files from local /tmp/woody/test to /user/woody/test and create the directory /user/woody/test if it does not exist.

 

SOURCEDIR="$1"

TARGETDIR="$2"

MAX_PARALLEL=4

nroffiles=$(ls $SOURCEDIR|wc -w)

setsize=$(( nroffiles/MAX_PARALLEL + 1 ))

hadoop fs -mkdir -p $TARGETDIR

ls -1 $SOURCEDIR/* | xargs -n $setsize | while read workset; do

  hadoop fs -put $workset $TARGETDIR &

done

wait