Member since
09-23-2015
151
Posts
110
Kudos Received
50
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1353 | 08-12-2016 04:05 AM | |
2640 | 08-07-2016 03:58 AM | |
1164 | 07-27-2016 06:24 PM | |
1704 | 07-20-2016 03:14 PM | |
1229 | 07-18-2016 12:54 PM |
03-21-2016
12:21 AM
An easier solution would be to read every row, and then add a check to see if it's a header row by checking one of the columns that you know only appears in a header row.
... View more
03-21-2016
12:03 AM
I'm not sure why you are getting that exception, but I do know that you do not need to get the path of any input files on the exam. What are you trying to do? Showing more of your code might help me to provide more insight.
... View more
03-18-2016
04:05 AM
1 Kudo
There is some good information in this thread, but I worry that the discussion about the MultiStorage class in the piggybank is going to seem like it's needed on the HDPCD exam. The MultiStorage class is not a part of the exam objectives. For the exam, you need to know how to use the PARALLEL operator, which if used at the right time in a Pig script can determine the number of output files. So to summarize: the HDPCD exam does not require the use of MutliStorage, but may require the use of PARALLEL.
... View more
03-16-2016
02:42 PM
3 Kudos
Sorting in MR applies to two areas: Sort output by keys: this done "naturally" in the sense that the keys are sorted as they come into the reducer. The compareTo method in the key class determines this natural sorting Secondary sort: both the keys and values are sorted. That involves writing a group comparator class and then registering that class with the MR Job using the setGroupingComparator class The exam objective you listed above is referring to both. The first one is fairly straightforward - you implement the compareTo method in your key class. The secondary sort involves a bit more work. There is a nice blog here that has an example of how to implement a secondary sort: https://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-programming-paradigm/
... View more
03-16-2016
12:07 PM
1 Kudo
@Sanjay Ungarala - thank you for pointing this out. We decided a while back to remove this objective from the HDPCD:Java exam, but the website was not changed accordingly. I just updated the website and removed that objective.
LocalResource is used heavily in YARN, but it is beyond the scope of our Java exam.
... View more
03-15-2016
11:45 AM
1 Kudo
You are given access to docs.hortonworks.com on the real exam, and that is the only website you can access. The practice exam is meant to give you an idea of what the real exam tasks are like and also to become familiar with the environment of the real exam. It is not enough to pass the exam though. Like I said, to be fully prepared, you should be able to perform all of the tasks listed in the exam objectives on our website.
... View more
03-12-2016
02:52 PM
1 Kudo
@Mahesh Deshmukh I'm not sure what your need is, but null values should be filtered first. The general rule of thumb in Pig is to "filter early and often" to minimize the amount of data that gets shuffled and sorted, so before the foreach: For example: a = "some Pig relation"
b = filter a by $1 is not null; //filter out tuples where the $1 field is null
c = foreach b generate ... //no need to worry about $1 being null The term "empty" refers to bags typically, and in particular you can use the isEmpty function to check if a bag is empty. You normally do this after a GROUP command: a = "some Pig relation"
b = group a by $3;
c = filter b by not IsEmpty(group); What are you trying to accomplish?
... View more
03-11-2016
07:56 PM
3 Kudos
Sure - the following simple script uses 3 reducers on the last operation, so there will be 3 output files: a = load 'something';
b = order a by $1 parallel 3;
store b into 'somewhere'; PARALLEL is not an option on STORE, but it is an option on a lot of other Pig operations.
... View more
03-11-2016
12:13 PM
3 Kudos
In case someone is searching for this in regards to the Hortonworks Certified Developer exam, the question was asked here also: https://community.hortonworks.com/questions/22439/where-do-i-get-references-for-piggybank.html Many of the Pig operators have a PARALLEL option for specifying the number of reducers, which also determines the number of output files. For the intent of the certification exam, using PARALLEL is all you need to accomplish this task, plus it is much simpler than trying to register the piggybank and use a special output class.
... View more
03-11-2016
12:09 PM
2 Kudos
Many of the Pig operators have a PARALLEL option for specifying the number of reducers, which also determines the number of output files. For the intent of the practice exam and the real exam, using PARALLEL is all you need to accomplish this task.
... View more