Reply
Explorer
Posts: 14
Registered: ‎12-19-2013

Best way to move data faster than the speed of a single disk from GPFS to HDFS?

We're trying to move data from GPFS to HDFS. Using HDFS put maxes out our write speed into HDFS at the speed of a single disk. Can we use distcp or some other tool to move data in parallel from GPFS to HDFS?

Highlighted
Cloudera Employee
Posts: 9
Registered: ‎08-15-2013

Re: Best way to move data faster than the speed of a single disk from GPFS to HDFS?

Unfortunately there are no tools that I am aware of that does that precanned.

 

You can always do the following in a script:

 

This will copy all files in a given directory to hadoop directory.

 

/copyParallel.sh /tmp/woody/test /user/woody/test

 

This will copy all files from local /tmp/woody/test to /user/woody/test and create the directory /user/woody/test if it does not exist.

 

SOURCEDIR="$1"

TARGETDIR="$2"

MAX_PARALLEL=4

nroffiles=$(ls $SOURCEDIR|wc -w)

setsize=$(( nroffiles/MAX_PARALLEL + 1 ))

hadoop fs -mkdir -p $TARGETDIR

ls -1 $SOURCEDIR/* | xargs -n $setsize | while read workset; do

  hadoop fs -put $workset $TARGETDIR &

done

wait