Support Questions

MarlinGomez · ‎12-08-2025

Hey everyone, I’m currently preparing for the CCA175 (Cloudera Data Engineer) exam and focusing heavily on real, hands-on scenario challenges to strengthen my understanding. So far, I’ve practiced with various ingestion and transformation pipelines, but I’m stuck on one scenario that feels very close to what the actual exam might present. Midway through my study plan, I started using Certs Matrix, which has helped me evaluate different approaches to solving Spark and Hadoop workflow problems under pressure. The scenario I’m trying to clarify is this: If you receive streaming data in inconsistent formats and need to cleanse, transform, and store it efficiently in HDFS, which approach would be most exam-accurate using Spark Structured Streaming with schema evolution, designing separate ETL pipelines for each input format, or relying on a unified schema-on-read strategy? I’d really appreciate insights from anyone who has taken CCA175 or handled similar real-world pipelines. Your guidance would help me refine my preparation.

RAGHUY · ‎02-08-2026

@MarlinGomez For that CCA175 streaming scenario with inconsistent formats, cleansing/transforming to HDFS, better to go with Spark Structured Streaming + schema evolution as the most exam-realistic pick.
It handles real-time ingestion efficiently via micro-batches, infers/evolves schemas on the fly (especially with JSON/Avro), and lets you apply transformations like filter/map before writing Parquet to HDFS.
Separate ETL pipelines per format add too much complexity/overhead for exam constraints, and pure schema-on-read skips proactive cleansing.
QuickStart with Kafka source, schema merging enabled: .option("mergeSchema", "true").writeStream... to HDFS.This nails the "perform ETL on data using Spark API" objective perfectly. Good luck on your prep.

Cloudera Community

Support Questions

Need Help Clarifying a Real CCA175 Scenario