We have a somewhat unusual requirement that we're experimenting with. We have a parquet table with ~320 GB of frequency pixel values in arrays - the data is from a 3-dimensional spectral line image cube. There is a legacy source finding algorithm written in C that is used to analyse the original data source files, but it's restrained to the physical memory available on one machine. We're running a CDH Express cluster, 3 management nodes (8 cores, 32 GB ram) 10 worker nodes (16 cores, 64 GB ram) and a gateway node running Jupyterhub (8 cores, 32 GB of ram). We've modified the original C program into a shared object, and distributed it across the cluster. We've incorporated the C shared object into a partitioner class so we can run multiple partitions in multiple executors across the cluster. We have this up and running.
The problem that we're experiencing is that we optimally need an input pixel array of a minimum of about 15GB and we seem to be hitting a wall at creating arrays of around 7.3 GB, and we're unsure of why this is.
We're aware that collecting distributed data into an array under normal circumstances isn't recommended; however, it seems that being able to run multiple C shared objects in parallel across the cluster could be more efficient than running 30-0 GB extractions serially on a single machine.
Thanks in advance for any thoughts and assistance.