Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

What is the fastest way to create large number of empty directories in hdfs?

avatar
Rising Star

For testing purposes I want to create very large number, let's say 1 million empty directories in hdfs.

What I tried to do is use `hdfs dfs -mkdir`, to create 8K directories and repeat this in a for loop.

for i in {1..125}
do
   dirs=""
   for j in {1..8000}; do
     dirs="$dirs /user/d$i.$j"
   done
   echo "$dirs"
   hdfs dfs -mkdir $dirs
done

Apparently it takes hours to create 1M folders this way.

My question is, what would be the fastest way to create 1M empty folders?

1 ACCEPTED SOLUTION

avatar
Expert Contributor
@pbarna

I think the Java API should be the fastest.

FileSystem fs = FileSystem.get(URI.create(hdfsUri), conf);

class DirectoryThread extends Thread {

  private int from;
  private int count;
  private static final String basePath = "/user/d";

  public DirectoryThread(int from, int count) {
    this.from = from;
    this.count = count;
  }

  @Override
  public void run() {
    for (int i = from; i < from + count; i++) {
      Path path = new Path(basePath + i);
      try {
        fs.mkdirs(path);
      } catch (IOException e) {
        e.printStackTrace();
      }
    }
  }
}

long startTime = System.currentTimeMillis();
int threadCount = 8;
Thread threads[] = new Thread[threadCount];
int total = 1000000;
int countPerThread = total / threadCount;
for (int j = 0; j < threadCount; j++) {
  Thread thread = new DirectoryThread(j * countPerThread, countPerThread);
  thread.start();
  threads[j] = thread;
}
for (Thread thread : threads) {
  thread.join();
}
long endTime = System.currentTimeMillis();

System.out.println("Total: " + (endTime - startTime) + " milliseconds");

Obviously, use as many threads as you can. But still, this takes 1-2 minutes, I wonder how @bkosaraju could "complete in few seconds with your code"

View solution in original post

4 REPLIES 4

avatar
Super Collaborator

Hi @pbarna,

you may use pig or grunt shell or hive CLI and pass all the directories at one shot which does much quicker.

avatar
Rising Star

Thanks for your response, @bkosaraju, can you give me an example of any of these options you mentioned?

avatar
Super Collaborator

I have done simple test and able to complete in few seconds with your code

and its wise to split in multiple pass.

#!/bin/bash
tgetfl=/tmp/hvdir_$(date +%s)
for i in {1..125}
do
   dirs=""
   for j in {1..8000}; do
     dirs="$dirs /dirtst/d$i.$j"
   done
   #echo "$dirs"
   echo dfs -mkdir $dirs
done > $tgetfl
date
hive -f $tgetfl
date

avatar
Expert Contributor
@pbarna

I think the Java API should be the fastest.

FileSystem fs = FileSystem.get(URI.create(hdfsUri), conf);

class DirectoryThread extends Thread {

  private int from;
  private int count;
  private static final String basePath = "/user/d";

  public DirectoryThread(int from, int count) {
    this.from = from;
    this.count = count;
  }

  @Override
  public void run() {
    for (int i = from; i < from + count; i++) {
      Path path = new Path(basePath + i);
      try {
        fs.mkdirs(path);
      } catch (IOException e) {
        e.printStackTrace();
      }
    }
  }
}

long startTime = System.currentTimeMillis();
int threadCount = 8;
Thread threads[] = new Thread[threadCount];
int total = 1000000;
int countPerThread = total / threadCount;
for (int j = 0; j < threadCount; j++) {
  Thread thread = new DirectoryThread(j * countPerThread, countPerThread);
  thread.start();
  threads[j] = thread;
}
for (Thread thread : threads) {
  thread.join();
}
long endTime = System.currentTimeMillis();

System.out.println("Total: " + (endTime - startTime) + " milliseconds");

Obviously, use as many threads as you can. But still, this takes 1-2 minutes, I wonder how @bkosaraju could "complete in few seconds with your code"