I have a bit of experience and practice in HortonWorks installation, maintenance and Talend based,sqoop ingestion. Nowadays I'm a pgsql Greenplum developer (core framework) at the same company where we are working on a Hadoop based approach also. The real problem is as we are thinking because we want to use HDFS,Hive as a MPP RDBMS.
I've read your post about Balancer daemon (100x Performance Improvement) where you speak about a "Block Pinning" but I haven't spent much time with the further investigation.
Is there any plan ( I know it is a huge change) to improve blocking procedure (constraint satisfaction algorithm) with "distributed by" mechanism as available in GreenPlum (over existing partitioning)?
You know when we play with joins in spark notebooks the networking is very significant. I read that the block splitting is handled in the storage file format writers.