To improve the performance of a dataset access I would like to replicate the blocks of the file to all datanodes. It's a dimension dataset. One way would be setting the replication factor to a number higher than the number of datanodes, but I would like to know if there is a better way to do this.
The approach you describe is a good way to get such a thing done. For alternatives' sake, you could also load the file paths into the application distributed cache, which will cause every NodeManager to download and keep a local copy of it during container executions. This isn't a good idea for very large files.