About rich1

rich1 · ‎03-21-2016

An easier solution would be to read every row, and then add a check to see if it's a header row by checking one of the columns that you know only appears in a header row.

rich1 · ‎03-21-2016

I'm not sure why you are getting that exception, but I do know that you do not need to get the path of any input files on the exam. What are you trying to do? Showing more of your code might help me to provide more insight.

rich1 · ‎03-18-2016

There is some good information in this thread, but I worry that the discussion about the MultiStorage class in the piggybank is going to seem like it's needed on the HDPCD exam. The MultiStorage class is not a part of the exam objectives. For the exam, you need to know how to use the PARALLEL operator, which if used at the right time in a Pig script can determine the number of output files. So to summarize: the HDPCD exam does not require the use of MutliStorage, but may require the use of PARALLEL.

rich1 · ‎03-16-2016

Sorting in MR applies to two areas: Sort output by keys: this done "naturally" in the sense that the keys are sorted as they come into the reducer. The compareTo method in the key class determines this natural sorting Secondary sort: both the keys and values are sorted. That involves writing a group comparator class and then registering that class with the MR Job using the setGroupingComparator class The exam objective you listed above is referring to both. The first one is fairly straightforward - you implement the compareTo method in your key class. The secondary sort involves a bit more work. There is a nice blog here that has an example of how to implement a secondary sort: https://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-programming-paradigm/

rich1 · ‎03-16-2016

@Sanjay Ungarala - thank you for pointing this out. We decided a while back to remove this objective from the HDPCD:Java exam, but the website was not changed accordingly. I just updated the website and removed that objective. LocalResource is used heavily in YARN, but it is beyond the scope of our Java exam.

rich1 · ‎03-15-2016

You are given access to docs.hortonworks.com on the real exam, and that is the only website you can access. The practice exam is meant to give you an idea of what the real exam tasks are like and also to become familiar with the environment of the real exam. It is not enough to pass the exam though. Like I said, to be fully prepared, you should be able to perform all of the tasks listed in the exam objectives on our website.

rich1 · ‎03-12-2016

@Mahesh Deshmukh I'm not sure what your need is, but null values should be filtered first. The general rule of thumb in Pig is to "filter early and often" to minimize the amount of data that gets shuffled and sorted, so before the foreach: For example: a = "some Pig relation" b = filter a by $1 is not null; //filter out tuples where the $1 field is null c = foreach b generate ... //no need to worry about $1 being null The term "empty" refers to bags typically, and in particular you can use the isEmpty function to check if a bag is empty. You normally do this after a GROUP command: a = "some Pig relation" b = group a by $3; c = filter b by not IsEmpty(group); What are you trying to accomplish?

rich1 · ‎03-11-2016

Sure - the following simple script uses 3 reducers on the last operation, so there will be 3 output files: a = load 'something'; b = order a by $1 parallel 3; store b into 'somewhere'; PARALLEL is not an option on STORE, but it is an option on a lot of other Pig operations.

rich1 · ‎03-11-2016

In case someone is searching for this in regards to the Hortonworks Certified Developer exam, the question was asked here also: https://community.hortonworks.com/questions/22439/where-do-i-get-references-for-piggybank.html Many of the Pig operators have a PARALLEL option for specifying the number of reducers, which also determines the number of output files. For the intent of the certification exam, using PARALLEL is all you need to accomplish this task, plus it is much simpler than trying to register the piggybank and use a special output class.

rich1 · ‎03-11-2016

Many of the Pig operators have a PARALLEL option for specifying the number of reducers, which also determines the number of output files. For the intent of the practice exam and the real exam, using PARALLEL is all you need to accomplish this task.

Online	Offline
Last Visited	‎07-24-2019 05:34 PM

Member Since	‎09-23-2015 07:25 PM
Last Visited	‎07-24-2019 05:34 PM
Posts	151
Kudos received	109

Cloudera Community

Re: HDPCD exam concern about Flume question.

Re: which version of spark is available HDPCD:SPAR...

Re: is JAVA knowldege is mandatory for HDPCD exam ...

Re: examslocal:Getting error while login...

Re: HDPCD Spark environment

Re: HDP Certified Java Developer practice exam

Re: HDP Certified Java Developer practice exam

Re: Store output file as 3 files using pig

Re: Sort the output of a MapReduce job

Re: Using a LocalResource instance to distribute f...

Re: HDPCA - Preparation

Re: how to handle null or empty scenarios in forea...

Re: Where do I get references for PiggyBank.

Re: Store output file as 3 files using pig

Re: Where do I get references for PiggyBank.