Created 08-22-2017 01:16 PM
For testing purposes I want to create very large number, let's say 1 million empty directories in hdfs.
What I tried to do is use `hdfs dfs -mkdir`, to create 8K directories and repeat this in a for loop.
for i in {1..125}
do
   dirs=""
   for j in {1..8000}; do
     dirs="$dirs /user/d$i.$j"
   done
   echo "$dirs"
   hdfs dfs -mkdir $dirs
done
Apparently it takes hours to create 1M folders this way.
My question is, what would be the fastest way to create 1M empty folders?
Created 08-23-2017 09:05 AM
I think the Java API should be the fastest.
FileSystem fs = FileSystem.get(URI.create(hdfsUri), conf);
class DirectoryThread extends Thread {
  private int from;
  private int count;
  private static final String basePath = "/user/d";
  public DirectoryThread(int from, int count) {
    this.from = from;
    this.count = count;
  }
  @Override
  public void run() {
    for (int i = from; i < from + count; i++) {
      Path path = new Path(basePath + i);
      try {
        fs.mkdirs(path);
      } catch (IOException e) {
        e.printStackTrace();
      }
    }
  }
}
long startTime = System.currentTimeMillis();
int threadCount = 8;
Thread threads[] = new Thread[threadCount];
int total = 1000000;
int countPerThread = total / threadCount;
for (int j = 0; j < threadCount; j++) {
  Thread thread = new DirectoryThread(j * countPerThread, countPerThread);
  thread.start();
  threads[j] = thread;
}
for (Thread thread : threads) {
  thread.join();
}
long endTime = System.currentTimeMillis();
System.out.println("Total: " + (endTime - startTime) + " milliseconds");Obviously, use as many threads as you can. But still, this takes 1-2 minutes, I wonder how @bkosaraju could "complete in few seconds with your code"
Created 08-22-2017 03:03 PM
Hi @pbarna,
you may use pig or grunt shell or hive CLI and pass all the directories at one shot which does much quicker.
Created 08-22-2017 04:00 PM
Thanks for your response, @bkosaraju, can you give me an example of any of these options you mentioned?
Created 08-23-2017 03:37 AM
I have done simple test and able to complete in few seconds with your code
and its wise to split in multiple pass.
#!/bin/bash
tgetfl=/tmp/hvdir_$(date +%s)
for i in {1..125}
do
   dirs=""
   for j in {1..8000}; do
     dirs="$dirs /dirtst/d$i.$j"
   done
   #echo "$dirs"
   echo dfs -mkdir $dirs
done > $tgetfl
date
hive -f $tgetfl
date
					
				
			
			
				
			
			
			
			
			
			
			
		Created 08-23-2017 09:05 AM
I think the Java API should be the fastest.
FileSystem fs = FileSystem.get(URI.create(hdfsUri), conf);
class DirectoryThread extends Thread {
  private int from;
  private int count;
  private static final String basePath = "/user/d";
  public DirectoryThread(int from, int count) {
    this.from = from;
    this.count = count;
  }
  @Override
  public void run() {
    for (int i = from; i < from + count; i++) {
      Path path = new Path(basePath + i);
      try {
        fs.mkdirs(path);
      } catch (IOException e) {
        e.printStackTrace();
      }
    }
  }
}
long startTime = System.currentTimeMillis();
int threadCount = 8;
Thread threads[] = new Thread[threadCount];
int total = 1000000;
int countPerThread = total / threadCount;
for (int j = 0; j < threadCount; j++) {
  Thread thread = new DirectoryThread(j * countPerThread, countPerThread);
  thread.start();
  threads[j] = thread;
}
for (Thread thread : threads) {
  thread.join();
}
long endTime = System.currentTimeMillis();
System.out.println("Total: " + (endTime - startTime) + " milliseconds");Obviously, use as many threads as you can. But still, this takes 1-2 minutes, I wonder how @bkosaraju could "complete in few seconds with your code"