About LesterMartin

LesterMartin · ‎10-06-2016

The current exams and certifications, described at http://hortonworks.com/training/certification/, are still current. A decision has been made to postpone the launch of the updated certification program to a future date. As part of that eventual announcement, more time will be allowed for those currently preparing for the existing certification examinations to complete them. So... yes, you can continue to take the current exams. Good luck!

LesterMartin · ‎10-06-2016

RDD's saveAsTextFile does not give us the opportunity to do that (DataFrame's have "save modes" for things like append/overwrite/ignore). You'll have to control this prior before (maybe delete or rename existing data) or afterwards (write the RDD as a diff dir and then swap it out).

LesterMartin · ‎10-06-2016

It is clear you've done some research already on YARN, so I'll try to respond briefly to each to see if my responses are what you are needing to proceed. 1. In general, the RM picks a worker node that can run the AM container, so yes, good to visualize it as a round-robin approach. 2. The RM has a subtask who is responsible to watch the AM's and if one of them dies can restart it's container on another node. The AM itself needs to be written in such a way that it is restartable, but generally speaking the ones we all use (MR, Tez, Spark, etc) are all restartable. That said, "the client" may, or may not, be affected, but the "job" itself can run to completion despite an AM failure. 3. This is up to the AM to request of the RM what container sizes it needs (and how many). So, yes, theoretically they could be of different sizes, but often will be the same size. Spark is a good example as we usually ask for N containers, but want them to all have the same # of cores and amount of memory. 4. I believe the RM is going to ensure all jars are shipped to the needed NodeManager (NM) instances on the worker nodes where your containers will be at. I also believe you have some options about pre-placing your jars on HDFS, but that's been a while since I toyed with than and it was with MapReduce. Hope this helps your understanding some. Good luck!

LesterMartin · ‎10-06-2016

Just get rid of "exec" as it thinks it is the script you are trying to run.

LesterMartin · ‎10-05-2016

I got a quick tutorial at https://martin.atlassian.net/wiki/x/C4BRAQ if it might help.

LesterMartin · ‎09-22-2016

Doh! @Sriharsha Chintalapani answer the questions in the comments section of another answer tells me "if you have parallelism lower than the topic partitions each executor of kafka spout will get multiple partitions to read from". Good stuff.

LesterMartin · ‎09-22-2016

Obviously it is on me to test it out 🙂 BUT... any initial thought of what happens when you have a smaller number of spout instances that the number of partitions for the kafka topic? Clearly, the spout instances either double (or triple or more) down on which partitions it is taking care of, or, we just don't consume the messages on the partitions that we don't have a spout instance for.

LesterMartin · ‎08-28-2016

It sounds like things are working as expected. Please consider a few things that may not be clear to you with regards to the number of underlying files. First, when you do a subsequent insert (or load) into a (non bucketed) table with existing data you will NOT merge the contents together into a single file. You can test this out by loading the same simple file of 10 or so rows multiple times. You'll see that on the 2nd and 3rd insert/load you will then have an identical 2nd and then 3rd file in the underlying hive table's hdfs folder. Second, for a new bucketed table that you add data to there is not really a guarantee that you will the number of files aligned to the number of buckets. With bucket hashing occurring on the clustered by field it is possible to have less files if the data doesn't align well. To see that in practice, create a table with 32 buckets and load a file with only 10 records into it. At most, you'll have 10 files (again, possibly fewer). Additionally, if the amount of data being added ends up having more data for a particular bucket that causes it to exceed the file block size, you'll actually get more than one file for that bucket. So... what is happening on subsequent inserts/loads is that you are just creating new files that align to how the new data is bucketed and they sit alongside the additional files that are already there. At this time, Hive can still benefit from bucketing by lining up more than one file per bucket to the joining table's bucketed data (yes, it may have multiple files, too, for that same bucket). If you want to get as few files as possible (just one for each bucket if all of a particular bucket's data fits within the block size) then you're right; you'll need to load the contents of this table into another table -- possibly using an ITAS or CTAS strategy.

LesterMartin · ‎08-09-2016

You're more than welcome @Ashish Vishnoi. If it was helping, and it is appropriate, I'd sure appreciate you marking my response as "Best Answer" to help me build up my points. 😉

LesterMartin · ‎08-08-2016

Actually, JOINs in Pig work about the same as they do in Hive. I wrote up a quick blog post at https://martin.atlassian.net/wiki/x/AgCfB based on your question and just made up some data since you didn't provide any. I'm sure you did this, but the easiest thing to do is simplify your join just working on two relations first then add the third and eventually the fourth if all is working. If "e" never gets populated, my hunch is that it is a data issue, not a Pig issue. As for saving into a partitioned Hive table, my blog post shows an example of that working as well as points back to https://community.hortonworks.com/questions/2562/appending-to-existing-partition-with-pig.html to address the fact that (and strategies for) Pig not being able to write to an existing partition. Good luck!

Online	Offline
Last Visited	‎03-04-2021 02:39 PM

Member Since	‎05-02-2019 12:59 PM
Last Visited	‎03-04-2021 02:39 PM
Posts	319
Kudos received	145

Cloudera Community

Re: How to create partitions on existing Hive tabl...

Re: Copying data from One HBase to another Hbase c...

Re: Number of Concurrent Users on HDP Sandbox in a...

Re: Reason for Hive dependency on PIg during insta...

Re: One datanode nearly full but not the others

Re: Until when is the current HDPCD exam available...

Re: Apache SPARK - Overwrite data file

Re: YARN Questions

Re: Hadoop developer practice exam pig script exec...

Re: Can some one suggest good source to learn UDFs...

Re: Where can I find a good example of a Storm top...

Re: Where can I find a good example of a Storm top...

Re: Hive bucketing is not working as expected in c...

Re: PIG inner join with different keys

Re: PIG inner join with different keys