My end goal is to run a few hive queries, get 1 csv file (with headers) per query, compress all those files together in one zip (not gzip or bzip, unfortunately, needs to open natively under windows) and hopefully get the zip back in hdfs.
My current solution (CTAS) ends up creating one directory per table, with possibly multiple files under it (depending on number of reducers and presence/absence of UNION). I can easily generate as well a header file per table with only one line in it.
Now how to put all that together?
The only option I could find implies to do all the processing locally (hdfs dfs -getmerge followed by a an actual zip command). This adds a lot of overhead and could technically fill up the local disk.
So my questions are:
is there a way to concatenate files inside hdfs without getting them locally?
is there a way to compress a bunch of files together (not individually) in zip, inside hdfs?
You can coalesce using Spark, or MergeContent in NiFi to "compact" processes without needing to go to -getmerge.
You should ideally avoid zip files on HDFS. They are not a common format in HDFS since they are not splittable. Reading a large ZIP file will therefore be only processable by a single mapper. Querying multiple part files of uncompressed CSV will be faster. If you need these files compressed in HDFS for archival while also able to query via Hive and other engines, use a different, compressed, binary format like Snappy w/ ORC.
If you just want a CSV, use Beeline's output format argument, and write the results to a file, which you can then ZIP.
The zip file is the output of the process, not to be read in hdfs anymore - it will just end up being downloaded and sent to a user. In this context using zip makes sense, as I am only looking at *compressing* multiple csv together, not reading them afterwards.
Using beeline with formatted output is what I do currently, but I end up downloading locally multiple gigs, compress and re-upload. This is a waste and could actually fill my local disks up.
Using coalesce in Spark is the best option I found, but the compression step is still not easy.
@Jordan Moore Not really relevant to the question but no this is not the point. The use case here is data export, where some clients have their own BI tools, processes and so on. They just need the data, csv in a zip file. Other clients do not have this in place and have a different access to this data.