Reply
New Contributor
Posts: 4
Registered: ‎06-19-2017

Any good methods for compacting small files in Spark?

I'm running into isues with having lots of small Avro and Parquet files being created and stored in my hdfs and I need a way to compact them through Spark and its native libraries.

 

I've seen that the standard methods for this seem to be coalesce and the Impala insert into a new table then insert back, but are there any better methods that have come on to the scene, or anything more Spark-centric?

Cloudera Employee
Posts: 447
Registered: ‎08-11-2014

Re: Any good methods for compacting small files in Spark?

It should be pretty trivial to read the data in format X using Spark into a DataFrame or Dataset, then repartition it to a smaller number of partitions, and write it in format X using Spark. The round-trip ought not change the data, but worth verifying. It should however always result in fewer and therefore larger files.

New Contributor
Posts: 4
Registered: ‎06-19-2017

Re: Any good methods for compacting small files in Spark?

I should have mentioned it in the first post, but I need to maintain existing partitions as they are, so I need to compact files within partitions.

Cloudera Employee
Posts: 447
Registered: ‎08-11-2014

Re: Any good methods for compacting small files in Spark?

If you mean partition in the sense of Parquet/Avro partitions by some key, that should be possible to preserve this way. In the general case of things like text files, a file is a partition already.

New Contributor
Posts: 4
Registered: ‎06-19-2017

Re: Any good methods for compacting small files in Spark?

I am only dealing with Parquet and Avro luckily, not text. And yes, I was referring to the key partitions in the files.

 

Sorry for going off topic, but I'm still quite new to Spark and the whole Hadoop ecosystem in general, so I'm still trying to get a feel for everything. To clarify, partitions of a RDDs/Dataframes are different than the key based partitions of the files? I had always thought they were the same.

Cloudera Employee
Posts: 447
Registered: ‎08-11-2014

Re: Any good methods for compacting small files in Spark?

Spark deals with arbitrary data, so its notion of partitions is not related to data that contains a key. However it's almost surely true that one key-based partition of data in, say, Parquet will map to one (or more) partitions of data in a DataFrame that just has data with that key.

Highlighted
Explorer
Posts: 24
Registered: ‎06-13-2017

Re: Any good methods for compacting small files in Spark?

The best way to deal with small files is to not have to deal with them at all. You might want to explore using Kudu or HBase as your storage engine instead of HDFS (Parquet).
Announcements