First off, I'm a big fan of this new CCP: Data Engineer exam approach. I know Cloudera is working things out right now and I have a few questions perhaps someone can provide some input on:
You can set up a single node running Hadoop in Pseudo-Distributed mode using VMs or Amazon EC2 instances with the Cloudera software and other tools running on them.
Then you can come up with projects yourself to practice.
Once you set up the tools, try to practice solving hypothetical problems that are called out here
They are not really hard to come up with.
The Data Ingestion portion talks about Flume, HDFS console commands and Sqoop.
The Transform, Stage, Store sections covers pig, Hive, Map/Reduce and Spark skills.
For the Data Analysis part, you need to know how to create tables in Hive that uses SerDe and other custom settings.
The Workflow portion covers skills you need from Apache Oozie.
The data tranformation part will have to be done in pig/hive and spark ,all three or its per our choice?
I mean do we have flexibility in choose tools or its mandatory as per given in the exam?
All the skills you need to be prepared to use in the hands on exam are on the CCP Data Engineer page. The page lists the exam delivery and cluster, documentation available and even a sample exam question to give you a feel for the exam.