Created 08-22-2017 01:16 PM
For testing purposes I want to create very large number, let's say 1 million empty directories in hdfs.
What I tried to do is use `hdfs dfs -mkdir`, to create 8K directories and repeat this in a for loop.
for i in {1..125} do dirs="" for j in {1..8000}; do dirs="$dirs /user/d$i.$j" done echo "$dirs" hdfs dfs -mkdir $dirs done
Apparently it takes hours to create 1M folders this way.
My question is, what would be the fastest way to create 1M empty folders?
Created 08-23-2017 09:05 AM
I think the Java API should be the fastest.
FileSystem fs = FileSystem.get(URI.create(hdfsUri), conf); class DirectoryThread extends Thread { private int from; private int count; private static final String basePath = "/user/d"; public DirectoryThread(int from, int count) { this.from = from; this.count = count; } @Override public void run() { for (int i = from; i < from + count; i++) { Path path = new Path(basePath + i); try { fs.mkdirs(path); } catch (IOException e) { e.printStackTrace(); } } } } long startTime = System.currentTimeMillis(); int threadCount = 8; Thread threads[] = new Thread[threadCount]; int total = 1000000; int countPerThread = total / threadCount; for (int j = 0; j < threadCount; j++) { Thread thread = new DirectoryThread(j * countPerThread, countPerThread); thread.start(); threads[j] = thread; } for (Thread thread : threads) { thread.join(); } long endTime = System.currentTimeMillis(); System.out.println("Total: " + (endTime - startTime) + " milliseconds");
Obviously, use as many threads as you can. But still, this takes 1-2 minutes, I wonder how @bkosaraju could "complete in few seconds with your code"
Created 08-22-2017 03:03 PM
Hi @pbarna,
you may use pig or grunt shell or hive CLI and pass all the directories at one shot which does much quicker.
Created 08-22-2017 04:00 PM
Thanks for your response, @bkosaraju, can you give me an example of any of these options you mentioned?
Created 08-23-2017 03:37 AM
I have done simple test and able to complete in few seconds with your code
and its wise to split in multiple pass.
#!/bin/bash tgetfl=/tmp/hvdir_$(date +%s) for i in {1..125} do dirs="" for j in {1..8000}; do dirs="$dirs /dirtst/d$i.$j" done #echo "$dirs" echo dfs -mkdir $dirs done > $tgetfl date hive -f $tgetfl date
Created 08-23-2017 09:05 AM
I think the Java API should be the fastest.
FileSystem fs = FileSystem.get(URI.create(hdfsUri), conf); class DirectoryThread extends Thread { private int from; private int count; private static final String basePath = "/user/d"; public DirectoryThread(int from, int count) { this.from = from; this.count = count; } @Override public void run() { for (int i = from; i < from + count; i++) { Path path = new Path(basePath + i); try { fs.mkdirs(path); } catch (IOException e) { e.printStackTrace(); } } } } long startTime = System.currentTimeMillis(); int threadCount = 8; Thread threads[] = new Thread[threadCount]; int total = 1000000; int countPerThread = total / threadCount; for (int j = 0; j < threadCount; j++) { Thread thread = new DirectoryThread(j * countPerThread, countPerThread); thread.start(); threads[j] = thread; } for (Thread thread : threads) { thread.join(); } long endTime = System.currentTimeMillis(); System.out.println("Total: " + (endTime - startTime) + " milliseconds");
Obviously, use as many threads as you can. But still, this takes 1-2 minutes, I wonder how @bkosaraju could "complete in few seconds with your code"