Created 08-22-2017 01:16 PM
For testing purposes I want to create very large number, let's say 1 million empty directories in hdfs.
What I tried to do is use `hdfs dfs -mkdir`, to create 8K directories and repeat this in a for loop.
for i in {1..125}
do
dirs=""
for j in {1..8000}; do
dirs="$dirs /user/d$i.$j"
done
echo "$dirs"
hdfs dfs -mkdir $dirs
done
Apparently it takes hours to create 1M folders this way.
My question is, what would be the fastest way to create 1M empty folders?
Created 08-23-2017 09:05 AM
I think the Java API should be the fastest.
FileSystem fs = FileSystem.get(URI.create(hdfsUri), conf);
class DirectoryThread extends Thread {
private int from;
private int count;
private static final String basePath = "/user/d";
public DirectoryThread(int from, int count) {
this.from = from;
this.count = count;
}
@Override
public void run() {
for (int i = from; i < from + count; i++) {
Path path = new Path(basePath + i);
try {
fs.mkdirs(path);
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
long startTime = System.currentTimeMillis();
int threadCount = 8;
Thread threads[] = new Thread[threadCount];
int total = 1000000;
int countPerThread = total / threadCount;
for (int j = 0; j < threadCount; j++) {
Thread thread = new DirectoryThread(j * countPerThread, countPerThread);
thread.start();
threads[j] = thread;
}
for (Thread thread : threads) {
thread.join();
}
long endTime = System.currentTimeMillis();
System.out.println("Total: " + (endTime - startTime) + " milliseconds");Obviously, use as many threads as you can. But still, this takes 1-2 minutes, I wonder how @bkosaraju could "complete in few seconds with your code"
Created 08-22-2017 03:03 PM
Hi @pbarna,
you may use pig or grunt shell or hive CLI and pass all the directories at one shot which does much quicker.
Created 08-22-2017 04:00 PM
Thanks for your response, @bkosaraju, can you give me an example of any of these options you mentioned?
Created 08-23-2017 03:37 AM
I have done simple test and able to complete in few seconds with your code
and its wise to split in multiple pass.
#!/bin/bash
tgetfl=/tmp/hvdir_$(date +%s)
for i in {1..125}
do
dirs=""
for j in {1..8000}; do
dirs="$dirs /dirtst/d$i.$j"
done
#echo "$dirs"
echo dfs -mkdir $dirs
done > $tgetfl
date
hive -f $tgetfl
date
Created 08-23-2017 09:05 AM
I think the Java API should be the fastest.
FileSystem fs = FileSystem.get(URI.create(hdfsUri), conf);
class DirectoryThread extends Thread {
private int from;
private int count;
private static final String basePath = "/user/d";
public DirectoryThread(int from, int count) {
this.from = from;
this.count = count;
}
@Override
public void run() {
for (int i = from; i < from + count; i++) {
Path path = new Path(basePath + i);
try {
fs.mkdirs(path);
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
long startTime = System.currentTimeMillis();
int threadCount = 8;
Thread threads[] = new Thread[threadCount];
int total = 1000000;
int countPerThread = total / threadCount;
for (int j = 0; j < threadCount; j++) {
Thread thread = new DirectoryThread(j * countPerThread, countPerThread);
thread.start();
threads[j] = thread;
}
for (Thread thread : threads) {
thread.join();
}
long endTime = System.currentTimeMillis();
System.out.println("Total: " + (endTime - startTime) + " milliseconds");Obviously, use as many threads as you can. But still, this takes 1-2 minutes, I wonder how @bkosaraju could "complete in few seconds with your code"