Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

What is the fastest way to create large number of empty directories in hdfs?

Contributor

For testing purposes I want to create very large number, let's say 1 million empty directories in hdfs.

What I tried to do is use `hdfs dfs -mkdir`, to create 8K directories and repeat this in a for loop.

for i in {1..125}
do
   dirs=""
   for j in {1..8000}; do
     dirs="$dirs /user/d$i.$j"
   done
   echo "$dirs"
   hdfs dfs -mkdir $dirs
done

Apparently it takes hours to create 1M folders this way.

My question is, what would be the fastest way to create 1M empty folders?

1 ACCEPTED SOLUTION

Expert Contributor
@pbarna

I think the Java API should be the fastest.

FileSystem fs = FileSystem.get(URI.create(hdfsUri), conf);

class DirectoryThread extends Thread {

  private int from;
  private int count;
  private static final String basePath = "/user/d";

  public DirectoryThread(int from, int count) {
    this.from = from;
    this.count = count;
  }

  @Override
  public void run() {
    for (int i = from; i < from + count; i++) {
      Path path = new Path(basePath + i);
      try {
        fs.mkdirs(path);
      } catch (IOException e) {
        e.printStackTrace();
      }
    }
  }
}

long startTime = System.currentTimeMillis();
int threadCount = 8;
Thread threads[] = new Thread[threadCount];
int total = 1000000;
int countPerThread = total / threadCount;
for (int j = 0; j < threadCount; j++) {
  Thread thread = new DirectoryThread(j * countPerThread, countPerThread);
  thread.start();
  threads[j] = thread;
}
for (Thread thread : threads) {
  thread.join();
}
long endTime = System.currentTimeMillis();

System.out.println("Total: " + (endTime - startTime) + " milliseconds");

Obviously, use as many threads as you can. But still, this takes 1-2 minutes, I wonder how @bkosaraju could "complete in few seconds with your code"

View solution in original post

4 REPLIES 4

Super Collaborator

Hi @pbarna,

you may use pig or grunt shell or hive CLI and pass all the directories at one shot which does much quicker.

Contributor

Thanks for your response, @bkosaraju, can you give me an example of any of these options you mentioned?

Super Collaborator

I have done simple test and able to complete in few seconds with your code

and its wise to split in multiple pass.

#!/bin/bash
tgetfl=/tmp/hvdir_$(date +%s)
for i in {1..125}
do
   dirs=""
   for j in {1..8000}; do
     dirs="$dirs /dirtst/d$i.$j"
   done
   #echo "$dirs"
   echo dfs -mkdir $dirs
done > $tgetfl
date
hive -f $tgetfl
date

Expert Contributor
@pbarna

I think the Java API should be the fastest.

FileSystem fs = FileSystem.get(URI.create(hdfsUri), conf);

class DirectoryThread extends Thread {

  private int from;
  private int count;
  private static final String basePath = "/user/d";

  public DirectoryThread(int from, int count) {
    this.from = from;
    this.count = count;
  }

  @Override
  public void run() {
    for (int i = from; i < from + count; i++) {
      Path path = new Path(basePath + i);
      try {
        fs.mkdirs(path);
      } catch (IOException e) {
        e.printStackTrace();
      }
    }
  }
}

long startTime = System.currentTimeMillis();
int threadCount = 8;
Thread threads[] = new Thread[threadCount];
int total = 1000000;
int countPerThread = total / threadCount;
for (int j = 0; j < threadCount; j++) {
  Thread thread = new DirectoryThread(j * countPerThread, countPerThread);
  thread.start();
  threads[j] = thread;
}
for (Thread thread : threads) {
  thread.join();
}
long endTime = System.currentTimeMillis();

System.out.println("Total: " + (endTime - startTime) + " milliseconds");

Obviously, use as many threads as you can. But still, this takes 1-2 minutes, I wonder how @bkosaraju could "complete in few seconds with your code"

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.