Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Fast loading one huge file

Fast loading one huge file

Rising Star

Hi dear experts!

 

i'm wounduring how it possible to acselirate loading into HDFS one huge file.

Let's say i have 1TB file on Linux FS and i need to load it as fast as possible.

 

could someone give me any idea or recomendation?

 

thanks!

5 REPLIES 5

Re: Fast loading one huge file

Master Guru
Note that given the singular source, the maximum speed will be capped by
the network speed, or the disk speed (if the file is already on a host that
runs a DataNode).

You can: Write with a replication factor of 1 (hadoop fs
-Ddfs.replication=1 -put local/path hdfs/path). This eliminates additional
network usage arising out of the DN pipeline communication in a
multi-replica write. It has the added bonus of copying only locally if you
run a DN on the same host as the source file, and your DN has an equal or
higher value of free space (as the file).
For (perhaps, untested) a slight increase in performance: Parallelise and
chunk the writes into multiple precise parts of the preferred block size
(256 MB), then merge them instantly with DistributedFileSystem.concat(…).
These writes can also independently be done with replication set to 1, to
eliminate DN pipeline communication from slowing your rate down.

Re: Fast loading one huge file

Rising Star

Thank you so much for your reply!

one question more:

when you said:

 

"For (perhaps, untested) a slight increase in performance: Parallelise and chunk the writes into multiple precise parts of the preferred block size (256 MB)"

 

did you mean cut file on the source side (like split command in the Linux) or something else?

 

thanks!

Re: Fast loading one huge file

Master Guru
In the simplest sense, yes. But implementation-wise I meant opening
multiple concurrent input streams on the same file but with varying seek
offsets and read finite lengths from such points.

Re: Fast loading one huge file

Rising Star

ok, got it :)

unfortunitelly, spandard hadoop client doesn't do this, you have to write java, right?

 

thanks for your reply!

Re: Fast loading one huge file

Rising Star

and one more question:

i've tried to:

1) mount source on every node of Hadoop cluster (over NFS)

2) run distcp command with -m option (where specivied few mappers) like: hadoop distcp file:///stage hdfs://stage/

 

I saw only one mapper. so, seems that hadoop is not able to create myltiple splits when it work with file:///. maybe the possible some workaround or other trick with hadoop distcp?

 

thanks!