Expert Contributor
Posts: 76
Registered: ‎11-24-2017

CCP Data Engineer (DE575) questions

[ Edited ]

Hello everyone!

I have some questions (in red) about the CCP Data Engineer (DE575) Exam (reference


    • Import and export data between an external RDBMS and your cluster, including the ability to import specific subsets, change the delimiter and file format of imported data during ingest, and alter the data access pattern or privileges.
      Can someone elaborate on the aspects related to data access pattern and privileges? If I understand correctly, Sqoop should be the right tool for data ingest from RDBMS, but it's not clear to me what we should be supposed to do to manage access pattern or privileges on Sqoop.

    • Ingest real-time and near-real time (NRT) streaming data into HDFS, including the ability to distribute to multiple data sources and convert data on ingest from one format to another.
      Is it sufficient to know only Flume or we also have to know Kafka?
    • Tune data for optimal query performance.
      Can someone provide an example of data tuning to improve performance?

    • Is it mandatory to know Pig? Spark seems to be much more versatile and should cover every trasformation task that you can do with Pig. Is it ok to skip Pig and concentrate on mastering Spark? 
    • Is it sufficient to know Hive? Or also Impala will be specifically required in the exercises? Although Hive and Impala seems to be very similar in terms of language/features, there are certain features that are available in Hive but not in Impala.
    • Create and execute a linear workflow with actions that include Hadoop jobs, Hive jobs, Pig jobs, custom actions, etc.
    • Create and execute a branching workflow with actions that include Hadoop jobs, Hive jobs, Pig jobs, custom action, etc.
    • Orchestrate a workflow to execute regularly at predefined times, including workflows that have data dependencies
    • If I understand correctly Oozie shoul be the right tool for the job. Can someone suggest/provide useful resources to prepare this part?
    1. Each user is given their own CDH cluster (currently 5.10.1) cluster pre-loaded with Spark, Impala, Crunch, Hive, Pig, Sqoop, Kafka, Flume, Kite, Hue, Oozie, DataFu, and many others [...].
      Is it ok to prepare the exame on CDH quickstart 5.12 (although 5.10.1 is provided during the exam)?