Created 03-18-2014 10:41 AM
We're trying to move data from GPFS to HDFS. Using HDFS put maxes out our write speed into HDFS at the speed of a single disk. Can we use distcp or some other tool to move data in parallel from GPFS to HDFS?
Created 03-20-2014 05:26 AM
Unfortunately there are no tools that I am aware of that does that precanned.
You can always do the following in a script:
This will copy all files in a given directory to hadoop directory.
/copyParallel.sh /tmp/woody/test /user/woody/test
This will copy all files from local /tmp/woody/test to /user/woody/test and create the directory /user/woody/test if it does not exist.
SOURCEDIR="$1"
TARGETDIR="$2"
MAX_PARALLEL=4
nroffiles=$(ls $SOURCEDIR|wc -w)
setsize=$(( nroffiles/MAX_PARALLEL + 1 ))
hadoop fs -mkdir -p $TARGETDIR
ls -1 $SOURCEDIR/* | xargs -n $setsize | while read workset; do
hadoop fs -put $workset $TARGETDIR &
done
wait