Community Articles

subratadas · ‎08-30-2021

File processing + autoscaling seems to have an antinomic relationship. It doesn't have to be..really. Autoscaling often drives inefficient behaviors..why? The concomitant response... "It Autoscales". Autoscaling still requires sound design principles. Without that, It won't autoscale well.

Autoscaling within a distributed framework requires proper data dichotomization along with processing/service layer decoupling

Anti-Pattern

Taking an example of where I've seen the most amount of heartburn.

Large files (i.e. zip/tar/etc) land in s3/Storage area. The knee-jerk reaction is to feed it into a distributed processing engine that autoscales. The "autoscaling" part appropriately. It may. Most likely you're flipping a coin and lighting a candle, hoping that all works out. What if the file sizes are heterogeneous and the variances between them could be quite significant? What about error handling? Does it have to be all or nothing (meaning all files need to be processing or all fail)?

What if autoscaling is driven through smaller processing units (groups)?

Here we have taken the same payloads, but defrag them into smaller units. Each unit requires a heterogeneous compute footprint driven resource consumption efficiencies.

Technical Details

For example, a payload (myExamplePayload.zip) containers 1000 JSON files. Instead of throwing this entire payload at a single compute cluster (requiring the maximum number of resources possible, aka top-line compute profile)...defrag the payload.

As the payload arrives in s3, CDF-X's (Cloudera Data Flow Experience) s3 processor listens for any files. CDF-X will pull the files, decompress, and write all the files back to s3. For example:

s3://My-S3-Bucket/decompressed/file1.json,  3://My-S3-Bucket/decompressed/file2.json, ...

In parallel CDF-X generates CDE (Cloudera Data Engineering, Spark) a job spec, a JSON payload. The job spec includes file locations. Each job would be provided with roughly ~100 file names+location. Since CDF-X knows about the file size, it can also hint to CDE how much compute would be required to process these files. This step is not necessary as we are already defragged the unit of work to something manageable, and therefore CDE autoscaling should kick in and perform well. Once the job spec is created, CDF-X calls CDE over the rest sending over the job spec. CDE accepts the job spec and arguments (file locations) and runs the micro workloads. Each workload has its own heterogeneous compute profile and auto-scales independently.

Wrapping Up

Defragmentation of large singleton payloads enables autoscaling to run more efficiently. Autoscale is an incredibly powerful capability often misused by not applying sound design principles. Leveraging these simple patterns allow for ease of operations, cost control, and manageability.

Cloudera Community

Community Articles

Autoscale File Processing - A Disciplined Approach

Apache Spark

Cloudera Data Engineering (CDE)

Cloudera DataFlow (CDF)

Anti-Pattern

Technical Details

Wrapping Up