Hue offers the abitlity to compress files in HDFS as follows:
1. Select one or more HDFS files in the Hue File Browser.
2. On the Actions menu select the Compress option.
Is there any documention about how to configure the cluster to support this?
So far it appears the following is required:
1. oozie user must have exexute permission on the HDFS directory tree
2. 'zip' shell command must be available on all HDFS data nodes?
3. Sufficient space in local (not HDFS) /tmp on all data nodes to hold the resulting compressed file.
Using local /tmp renders this feature unsable for large HDFS files. Can the local temp directory for this be changed?
We have resource caging applied on our yarn queue, And hence when we are trying to compress a file it creates the oozie job in backend which fail with error "cannot submit job in root.default queue"
Is there a way we can specify queue while running the compression.
If not what workaround or fix i should use so that the oozie job created by the compression feature of hue goes to the correct yarn queue, that user has permission to submit job into.
We are on CDH 5.16.1
I've not found a way to specify the resource queue for the compression oozie job. We created a root.user queue with small max resources to catch cases where the queue cannot be specified. Not ideal but works around the problem.
To update my original message:
3. Not found a way to move the local tmp directory used by the oozie job
Thanks for the update.
We have not kept any such queue, as we have many oncoming new users who dont know that they have to specify the queue name while submitting a queue, or how to do that. So they finally end up getting the error 'cannot submit application in root.default queue' for their normal jobs.
Thanks for giving your idea, but If we create such queue(as your root.user) to submit jobs if nothing is specified as queue, that will not respect the model we have. I can once again check within my team if this can be done.
Moreover it would be good to know how much you have kept the max resource for your root.user queue. and what is the largest size of dataset( im MB or GB) you have seen getting compressed with this value.
This will give us an idea to decide on our max resource if we opt your way to tackle it.
Yes, we had the same dilemma when creating a fall-back queue as it doesn’t respect our model either!
We observed the compress files in HDFS Oozie job to be allocated 1 container with 2GiB of memory and 1 VCore in YARN. We use a 5 VCore and 10GiB resource queue and the largest amount of data we’ve compressed is 100GiB. The YARN resource allocation doesn’t seem to change based on amount of data being compressed and therefore I think the YARN queue will not be limiting.
As discussed earlier in the thread the architecture of the compress files in HDFS feature doesn’t appear to be very scalable:
1. All the data being compressed is first localized (copied) to a YARN Node Manager’s local cache (one directory is chosen from yarn.nodemanager.local-dirs). This requires enough local disk space on the partition where the directory resides.
2. The zip shell command is run locally on the same YARN node and uses 1x CPU core; the default zip compress is quite slow.
3. Enough space is required in local /tmp to hold a copy of the completed zip file before it is copied up to HDFS.
Without any documentation on the compress files in HDFS feature this is just my opinion based on observations in our environment and reverse engineering.