Community Articles

Find and share helpful community-sourced technical articles.
avatar
Master Guru

2021-08-30_12-12-59.jpg

File processing + autoscaling seems to have an antinomic relationship. It doesn't have to be..really.  Autoscaling often drives inefficient behaviors..why? The concomitant response... "It Autoscales". Autoscaling still requires sound design principles. Without that, It won't autoscale well.

 

Autoscaling within a distributed framework requires proper data dichotomization along with processing/service layer decoupling

Anti-Pattern

Taking an example of where I've seen the most amount of heartburn.

2021-08-30_12-41-07.jpg

Large files (i.e. zip/tar/etc) land in s3/Storage area. The knee-jerk reaction is to feed it into a distributed processing engine that autoscales. The "autoscaling" part appropriately. It may. Most likely you're flipping a coin and lighting a candle, hoping that all works out. What if the file sizes are heterogeneous and the variances between them could be quite significant? What about error handling? Does it have to be all or nothing (meaning all files need to be processing or all fail)?

What if autoscaling is driven through smaller processing units (groups)?

 

2021-08-30_13-46-54.jpg

Here we have taken the same payloads, but defrag them into smaller units. Each unit requires a heterogeneous compute footprint driven resource consumption efficiencies.

Technical Details

For example, a payload (myExamplePayload.zip) containers 1000 JSON files. Instead of throwing this entire payload at a single compute cluster (requiring the maximum number of resources possible, aka top-line compute profile)...defrag the payload.

2021-08-30_13-55-37.jpg

 As the payload arrives in s3, CDF-X's (Cloudera Data Flow Experience) s3 processor listens for any files. CDF-X will pull the files, decompress, and write all the files back to s3. For example:

 

s3://My-S3-Bucket/decompressed/file1.json,  3://My-S3-Bucket/decompressed/file2.json, ...

 

2021-08-30_15-07-11.jpgIn parallel CDF-X generates CDE (Cloudera Data Engineering, Spark) a job spec, a JSON payload. The job spec includes file locations.  Each job would be provided with roughly ~100 file names+location. Since CDF-X knows about the file size, it can also hint to CDE how much compute would be required to process these files. This step is not necessary as we are already defragged the unit of work to something manageable, and therefore CDE autoscaling should kick in and perform well. Once the job spec is created, CDF-X calls CDE over the rest sending over the job spec. CDE accepts the job spec and arguments (file locations) and runs the micro workloads. Each workload has its own heterogeneous compute profile and auto-scales independently. 

Wrapping Up

Defragmentation of large singleton payloads enables autoscaling to run more efficiently. Autoscale is an incredibly powerful capability often misused by not applying sound design principles. Leveraging these simple patterns allow for ease of operations, cost control, and manageability.  

1,442 Views