About LesterMartin

LesterMartin · ‎03-07-2017

Pig is definitely and option, but a couple points. If you only do this once a month and have all the daily files (say 1st - 31st of the month) then understand that Pig doesn't do simple control loop logic (as identified in the presentation that my answer above points you to) and you'd have to wrap it with some controlling script or something. But... on the other hand... if you get a new daily file each day then Pig is going to be your best friend since the previous day's finalized file is now the new "origs" file from above and you just do the delta processing one file at a time. I'm sure there's more to it than I'm imagining, but that general pattern I quickly described is HIGHLY leveraged by many Pig users out there. Good luck & thanks for accepting my answer!

LesterMartin · ‎03-07-2017

Hey @Kibrom Gebrehiwot, just wondering if my answer below was able to help you out and if so, I'd sure appreciate it you marked it "Best Answer" by clicking the "Accept" link at the bottom of it. 😉 If it isn't helpful, please add a comment to the answer and let me know what concerns you may still have. Thanks!

LesterMartin · ‎03-07-2017

I have a newly created HDP 2.5.3 cluster with Kerberos enabled that I'm having trouble getting a simple Storm topology submitted. I do NOT have Ranger installed. I'm following the validation instructions at the bottom of http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_command-line-installation/content/validate_installation_storm.html to run the included simple WordCount topology which reads as the following. storm jar /usr/hdp/current/storm-client/contrib/storm-starter/storm-starter-topologies-*.jar org.apache.storm.starter.WordCountTopology wordcount I tried this two different ways with two different results. ** FIRST ATTEMPT ** (the authentication problem!!) I created a Kerberos ticket for one of my users, student2, as shown below. [student2@ip-172-30-0-42 target]$ klist Ticket cache: FILE:/tmp/krb5cc_432201241 Default principal: student2@LAB.HORTONWORKS.NET Valid starting Expires Service principal 03/07/2017 02:57:33 03/07/2017 12:57:33 krbtgt/LAB.HORTONWORKS.NET@LAB.HORTONWORKS.NET renew until 03/14/2017 02:57:29 Then I run the earlier topology submission command and get the following excerpt (full output at student2.txt). 976 [main] INFO o.a.s.s.a.AuthUtils - Got AutoCreds [] 1001 [main] WARN o.a.s.s.a.k.ClientCallbackHandler - Could not login: the client is being asked for a password, but the client code does not currently support obtaining a password from the user. Make sure that the client is configured to use a ticket cache (using the JAAS configuration setting 'useTicketCache=true)' and restart the client. If you still get this message after that, the TGT in the ticket cache has expired and must be manually refreshed. To do so, first determine if you are using a password or a keytab. If the former, run kinit in a Unix shell in the environment of the user who is running this client using the command 'kinit <princ>' (where <princ> is the name of the client's Kerberos principal). If the latter, do 'kinit -k -t <keytab> <princ>' (where <princ> is the name of the Kerberos principal, and <keytab> is the location of the keytab file). After manually refreshing your cache, restart this client. If you continue to see this message after manually refreshing your cache, ensure that your KDC host's clock is in sync with this host's clock. 1002 [main] ERROR o.a.s.s.a.k.KerberosSaslTransportPlugin - Server failed to login in principal:javax.security.auth.login.LoginException: No password provided javax.security.auth.login.LoginException: No password provided at com.sun.security.auth.module.Krb5LoginModule.promptForPass(Krb5LoginModule.java:919) ~[?:1.8.0_121] at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:760) ~[?:1.8.0_121] To me... this looks like student2's kerb ticket is not making the journey and thus, the authentication exception is being thrown. QUESTION: Is there anything special I need to be doing in order to have the ticket be leveraged at submission time? ** SECOND ATTEMPT ** (the authorization problem!!) I then figured I'd try to run the command again, but this time with a valid ticket for the storm user thinking that its God-like powers should persevere. [root@ip-172-30-0-42 simplestorm]# klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: storm-telus_training@LAB.HORTONWORKS.NET Valid starting Expires Service principal 03/07/2017 03:37:16 03/07/2017 13:37:16 krbtgt/LAB.HORTONWORKS.NET@LAB.HORTONWORKS.NET renew until 03/14/2017 03:37:16 I submitted the WC topology again and this time got this excerpt (full output at storm.txt). 2269 [main] INFO o.a.s.StormSubmitter - Successfully uploaded topology jar to assigned location: /hadoop/storm/nimbus/inbox/stormjar-cac76801-cea6-4c4e-9420-44d69bd7cb9b.jar 2278 [main] INFO o.a.s.m.n.Login - successfully logged in. 2302 [main] INFO o.a.s.m.n.Login - successfully logged in. 2310 [main] INFO o.a.s.StormSubmitter - Submitting topology wordcount in distributed mode with conf {"storm.zookeeper.topology.auth.scheme":"digest","storm.zookeeper.topology.auth.payload":"-5661685876145720659:-8904469779744658388","topology.workers":3,"topology.debug":true} Exception in thread "main" java.lang.RuntimeException: AuthorizationException(msg:wordcount-2-1488857970-stormconf.ser does not appear to be a valid blob key) at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:255) at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:310) To me... it looks like I got hung up on an authorization problem this time (which probably answers my earlier question about if anything special is needed for the kerb ticket to be passed along) although I'm not sure what that "does not appear to be a valid blob key" message is saying. QUESTION: What settings do I need to check in Ambari that would tell Storm to allow all secured users to be able to submit a topology? << reminder; I do NOT have Ranger installed Any assistance, even a hint, would be greatly appreciated!!

LesterMartin · ‎03-05-2017

As a quick follow-up to @Sunile Manjee's perfect answer, check out https://martin.atlassian.net/wiki/x/0zToBQ for how to grant DBA-level privileges to these new namespaces. Heck, here it is in from the shell's perspective. grant 'the_username', 'RWXC', '@my_ns'

LesterMartin · ‎03-04-2017

Great question. My solution below is trimmed out of the presentation described at https://martin.atlassian.net/wiki/x/GYBzAg on a much bigger topic. So, let's assume you have an this original file that has an ID, a date created, and three "payload" attributes. [root@sandbox ~]# hdfs dfs -cat origs.csv 11,2014-09-17,base,base,base 12,2014-09-17,base,base,base 13,2014-09-17,base,base,base 14,2014-09-18,base,base,base 15,2014-09-18,base,base,base 16,2014-09-18,base,base,base 17,2014-09-19,base,base,base 18,2014-09-19,base,base,base 19,2014-09-19,base,base,base Now, let's assume you have a delta file that has 4 new records (ID's 10, 20, 21 and 22) as well as more recent versions of 3 other records (IDs 12, 14, and 16). [root@sandbox ~]# hdfs dfs -cat delta.csv 10,2014-09-16,oops,was,missed 20,2014-09-20,base,base,base 21,2014-09-20,base,base,base 22,2014-09-20,base,base,base 12,2014-09-17,base,CHANGED,base 14,2014-09-18,base,CHANGED,base 16,2014-09-18,base,CHANGED,base Then, in a Pig script you could join these like this. origs = LOAD '/user/maria_dev/hcc/86778/original.csv' USING PigStorage(',') AS ( bogus_id:int, date_cr: chararray, field1:chararray, field2:chararray, field3:chararray ); delta = LOAD '/user/maria_dev/hcc/86778/delta.csv' USING PigStorage(',') AS ( bogus_id:int, date_cr: chararray, field1:chararray, field2:chararray, field3:chararray ); joined = JOIN origs BY bogus_id FULL OUTER, delta BY bogus_id; DESCRIBE joined; DUMP joined; And get this output. joined: {origs::bogus_id: int,origs::date_cr: chararray,origs::field1: chararray,origs::field2: chararray,origs::field3: chararray,delta::bogus_id: int,delta::date_cr: chararray,delta::field1: chararray,delta::field2: chararray,delta::field3: chararray} (,,,,,10,2014-09-16,oops,was,missed) (11,2014-09-17,base,base,base,,,,,) (12,2014-09-17,base,base,base,12,2014-09-17,base,CHANGED,base) (13,2014-09-17,base,base,base,,,,,) (14,2014-09-18,base,base,base,14,2014-09-18,base,CHANGED,base) (15,2014-09-18,base,base,base,,,,,) (16,2014-09-18,base,base,base,16,2014-09-18,base,CHANGED,base) (17,2014-09-19,base,base,base,,,,,) (18,2014-09-19,base,base,base,,,,,) (19,2014-09-19,base,base,base,,,,,) (,,,,,20,2014-09-20,base,base,base) (,,,,,21,2014-09-20,base,base,base) (,,,,,22,2014-09-20,base,base,base) You'll see above that if the delta record's fields are present (the ones on the right side) then they should be the ones carried forward as they are either new (4) or modified (3) records, but if they are missing (6) then the original values should just roll forward. merged = FOREACH joined GENERATE ((delta::bogus_id is not null) ? delta::bogus_id: origs::bogus_id) as bogus_id, ((delta::date_cr is not null) ? delta::date_cr: origs::date_cr) as date_cr, ((delta::field1 is not null) ? delta::field1: origs::field1) as field1, ((delta::field2 is not null) ? delta::field2: origs::field2) as field2, ((delta::field3 is not null) ? delta::field3: origs::field3) as field3; DESCRIBE merged; DUMP merged; As you can see from the combined output, we have the necessary 13 rows in the new dataset. merged: {bogus_id: int,date_cr: chararray,field1: chararray,field2: chararray,field3: chararray} (10,2014-09-16,oops,was,missed) (11,2014-09-17,base,base,base) (12,2014-09-17,base,CHANGED,base) (13,2014-09-17,base,base,base) (14,2014-09-18,base,CHANGED,base) (15,2014-09-18,base,base,base) (16,2014-09-18,base,CHANGED,base) (17,2014-09-19,base,base,base) (18,2014-09-19,base,base,base) (19,2014-09-19,base,base,base) (20,2014-09-20,base,base,base) (21,2014-09-20,base,base,base) (22,2014-09-20,base,base,base) Good luck and happy Hadooping!

LesterMartin · ‎03-03-2017

https://martin.atlassian.net/wiki/x/C4BRAQ shows an example (and old one that I did back with Hue) of loading the jar file into HDFS and then registering it like in the below example. REGISTER 'hdfs:///user/hue/shared/pig/udfs/exampleudf.jar'; DEFINE SIMPLEUPPER exampleudf.UPPER(); typing_line = LOAD '/user/hue/testData/typingText.txt' AS (row:chararray); upper_typing_line = FOREACH typing_line GENERATE SIMPLEUPPER(row); DUMP upper_typing_line; Good luck & happy Hadooping!

LesterMartin · ‎02-26-2017

For Pig and Hive implementations, I'd suggest you create a UDF. If new territory for you, here are some quick blog posts on creating (simple) UDFs for Pig and Hive; https://martin.atlassian.net/wiki/x/C4BRAQ and https://martin.atlassian.net/wiki/x/GoBRAQ. Good luck

LesterMartin · ‎02-26-2017

No current plans to do that. We had vendors that host the exams which we have to pay and cannot turn this into an expense for ourselves; sorry. That said, we are ~~~considering~~~ bundling in a voucher with our https://hortonworks.com/services/training/class/hadoop-essentials/ class offering, but again, that is only an internal discussion at this time. Best of luck when you take the test!!

LesterMartin · ‎02-20-2017

I've got a working example at https://github.com/lestermartin/oss-transform-processing-comparison/tree/master/profiling#hive that shows column stats.

LesterMartin · ‎02-13-2017

Good point. For my Sandbox testing, I decided to just use the steps provided in http://stackoverflow.com/questions/40550011/zeppelin-how-to-restart-sparkcontext-in-zeppelin to stop the SparkContext when I need to do something outside of Zeppelin. Not ideal, but working good enough for some multi-framework prototyping I'm doing.

Online	Offline
Last Visited	‎03-04-2021 02:39 PM

Member Since	‎05-02-2019 12:59 PM
Last Visited	‎03-04-2021 02:39 PM
Posts	319
Kudos received	145

Cloudera Community

Re: How to create partitions on existing Hive tabl...

Re: Copying data from One HBase to another Hbase c...

Re: Number of Concurrent Users on HDP Sandbox in a...

Re: Reason for Hive dependency on PIg during insta...

Re: One datanode nearly full but not the others

Re: How to use Pig to replace records from a relat...

Re: How to use Pig to replace records from a relat...

Authentication and Authorization errors on simple ...

Re: How to create Namespace in hbase?

Re: How to use Pig to replace records from a relat...

Re: Register UDF in PIG

Re: how to identify certain keywords from a flat f...

Re: HCA - HORTONWORKS CERTIFIED ASSOCIATE Certific...

Re: Viewing Hive Column or Table level Statistics

Re: How can I limit the amount of YARN memory allo...