Support Questions

Find answers, ask questions, and share your expertise

What is the fastest way to create large number of empty directories in hdfs?

avatar
Rising Star

For testing purposes I want to create very large number, let's say 1 million empty directories in hdfs.

What I tried to do is use `hdfs dfs -mkdir`, to create 8K directories and repeat this in a for loop.

for i in {1..125}
do
   dirs=""
   for j in {1..8000}; do
     dirs="$dirs /user/d$i.$j"
   done
   echo "$dirs"
   hdfs dfs -mkdir $dirs
done

Apparently it takes hours to create 1M folders this way.

My question is, what would be the fastest way to create 1M empty folders?

1 ACCEPTED SOLUTION

avatar
Expert Contributor
@pbarna

I think the Java API should be the fastest.

FileSystem fs = FileSystem.get(URI.create(hdfsUri), conf);

class DirectoryThread extends Thread {

  private int from;
  private int count;
  private static final String basePath = "/user/d";

  public DirectoryThread(int from, int count) {
    this.from = from;
    this.count = count;
  }

  @Override
  public void run() {
    for (int i = from; i < from + count; i++) {
      Path path = new Path(basePath + i);
      try {
        fs.mkdirs(path);
      } catch (IOException e) {
        e.printStackTrace();
      }
    }
  }
}

long startTime = System.currentTimeMillis();
int threadCount = 8;
Thread threads[] = new Thread[threadCount];
int total = 1000000;
int countPerThread = total / threadCount;
for (int j = 0; j < threadCount; j++) {
  Thread thread = new DirectoryThread(j * countPerThread, countPerThread);
  thread.start();
  threads[j] = thread;
}
for (Thread thread : threads) {
  thread.join();
}
long endTime = System.currentTimeMillis();

System.out.println("Total: " + (endTime - startTime) + " milliseconds");

Obviously, use as many threads as you can. But still, this takes 1-2 minutes, I wonder how @bkosaraju could "complete in few seconds with your code"

View solution in original post

4 REPLIES 4

avatar
Super Collaborator

Hi @pbarna,

you may use pig or grunt shell or hive CLI and pass all the directories at one shot which does much quicker.

avatar
Rising Star

Thanks for your response, @bkosaraju, can you give me an example of any of these options you mentioned?

avatar
Super Collaborator

I have done simple test and able to complete in few seconds with your code

and its wise to split in multiple pass.

#!/bin/bash
tgetfl=/tmp/hvdir_$(date +%s)
for i in {1..125}
do
   dirs=""
   for j in {1..8000}; do
     dirs="$dirs /dirtst/d$i.$j"
   done
   #echo "$dirs"
   echo dfs -mkdir $dirs
done > $tgetfl
date
hive -f $tgetfl
date

avatar
Expert Contributor
@pbarna

I think the Java API should be the fastest.

FileSystem fs = FileSystem.get(URI.create(hdfsUri), conf);

class DirectoryThread extends Thread {

  private int from;
  private int count;
  private static final String basePath = "/user/d";

  public DirectoryThread(int from, int count) {
    this.from = from;
    this.count = count;
  }

  @Override
  public void run() {
    for (int i = from; i < from + count; i++) {
      Path path = new Path(basePath + i);
      try {
        fs.mkdirs(path);
      } catch (IOException e) {
        e.printStackTrace();
      }
    }
  }
}

long startTime = System.currentTimeMillis();
int threadCount = 8;
Thread threads[] = new Thread[threadCount];
int total = 1000000;
int countPerThread = total / threadCount;
for (int j = 0; j < threadCount; j++) {
  Thread thread = new DirectoryThread(j * countPerThread, countPerThread);
  thread.start();
  threads[j] = thread;
}
for (Thread thread : threads) {
  thread.join();
}
long endTime = System.currentTimeMillis();

System.out.println("Total: " + (endTime - startTime) + " milliseconds");

Obviously, use as many threads as you can. But still, this takes 1-2 minutes, I wonder how @bkosaraju could "complete in few seconds with your code"