Member since
05-02-2019
319
Posts
145
Kudos Received
59
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
7170 | 06-03-2019 09:31 PM | |
1744 | 05-22-2019 02:38 AM | |
2194 | 05-22-2019 02:21 AM | |
1382 | 05-04-2019 08:17 PM | |
1684 | 04-14-2019 12:06 AM |
03-07-2017
06:27 AM
Pig is definitely and option, but a couple points. If you only do this once a month and have all the daily files (say 1st - 31st of the month) then understand that Pig doesn't do simple control loop logic (as identified in the presentation that my answer above points you to) and you'd have to wrap it with some controlling script or something. But... on the other hand... if you get a new daily file each day then Pig is going to be your best friend since the previous day's finalized file is now the new "origs" file from above and you just do the delta processing one file at a time. I'm sure there's more to it than I'm imagining, but that general pattern I quickly described is HIGHLY leveraged by many Pig users out there. Good luck & thanks for accepting my answer!
... View more
03-07-2017
04:53 AM
Hey @Kibrom Gebrehiwot, just wondering if my answer below was able to help you out and if so, I'd sure appreciate it you marked it "Best Answer" by clicking the "Accept" link at the bottom of it. 😉 If it isn't helpful, please add a comment to the answer and let me know what concerns you may still have. Thanks!
... View more
03-07-2017
04:12 AM
1 Kudo
I have a newly created HDP 2.5.3 cluster with Kerberos enabled that I'm having trouble getting a simple Storm topology submitted. I do NOT have Ranger installed. I'm following the validation instructions at the bottom of http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_command-line-installation/content/validate_installation_storm.html to run the included simple WordCount topology which reads as the following. storm jar /usr/hdp/current/storm-client/contrib/storm-starter/storm-starter-topologies-*.jar org.apache.storm.starter.WordCountTopology wordcount I tried this two different ways with two different results. ** FIRST ATTEMPT ** (the authentication problem!!) I created a Kerberos ticket for one of my users, student2, as shown below. [student2@ip-172-30-0-42 target]$ klist
Ticket cache: FILE:/tmp/krb5cc_432201241
Default principal: student2@LAB.HORTONWORKS.NET
Valid starting Expires Service principal
03/07/2017 02:57:33 03/07/2017 12:57:33 krbtgt/LAB.HORTONWORKS.NET@LAB.HORTONWORKS.NET
renew until 03/14/2017 02:57:29 Then I run the earlier topology submission command and get the following excerpt (full output at student2.txt). 976 [main] INFO o.a.s.s.a.AuthUtils - Got AutoCreds []
1001 [main] WARN o.a.s.s.a.k.ClientCallbackHandler - Could not login: the client is being asked for a password, but the client code does not currently support obtaining a password from the user. Make sure that the client is configured to use a ticket cache (using the JAAS configuration setting 'useTicketCache=true)' and restart the client. If you still get this message after that, the TGT in the ticket cache has expired and must be manually refreshed. To do so, first determine if you are using a password or a keytab. If the former, run kinit in a Unix shell in the environment of the user who is running this client using the command 'kinit <princ>' (where <princ> is the name of the client's Kerberos principal). If the latter, do 'kinit -k -t <keytab> <princ>' (where <princ> is the name of the Kerberos principal, and <keytab> is the location of the keytab file). After manually refreshing your cache, restart this client. If you continue to see this message after manually refreshing your cache, ensure that your KDC host's clock is in sync with this host's clock.
1002 [main] ERROR o.a.s.s.a.k.KerberosSaslTransportPlugin - Server failed to login in principal:javax.security.auth.login.LoginException: No password provided
javax.security.auth.login.LoginException: No password provided
at com.sun.security.auth.module.Krb5LoginModule.promptForPass(Krb5LoginModule.java:919) ~[?:1.8.0_121]
at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:760) ~[?:1.8.0_121] To me... this looks like student2's kerb ticket is not making the journey and thus, the authentication exception is being thrown. QUESTION: Is there anything special I need to be doing in order to have the ticket be leveraged at submission time? ** SECOND ATTEMPT ** (the authorization problem!!) I then figured I'd try to run the command again, but this time with a valid ticket for the storm user thinking that its God-like powers should persevere. [root@ip-172-30-0-42 simplestorm]# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: storm-telus_training@LAB.HORTONWORKS.NET
Valid starting Expires Service principal
03/07/2017 03:37:16 03/07/2017 13:37:16 krbtgt/LAB.HORTONWORKS.NET@LAB.HORTONWORKS.NET
renew until 03/14/2017 03:37:16 I submitted the WC topology again and this time got this excerpt (full output at storm.txt). 2269 [main] INFO o.a.s.StormSubmitter - Successfully uploaded topology jar to assigned location: /hadoop/storm/nimbus/inbox/stormjar-cac76801-cea6-4c4e-9420-44d69bd7cb9b.jar
2278 [main] INFO o.a.s.m.n.Login - successfully logged in.
2302 [main] INFO o.a.s.m.n.Login - successfully logged in.
2310 [main] INFO o.a.s.StormSubmitter - Submitting topology wordcount in distributed mode with conf {"storm.zookeeper.topology.auth.scheme":"digest","storm.zookeeper.topology.auth.payload":"-5661685876145720659:-8904469779744658388","topology.workers":3,"topology.debug":true}
Exception in thread "main" java.lang.RuntimeException: AuthorizationException(msg:wordcount-2-1488857970-stormconf.ser does not appear to be a valid blob key)
at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:255)
at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:310) To me... it looks like I got hung up on an authorization problem this time (which probably answers my earlier question about if anything special is needed for the kerb ticket to be passed along) although I'm not sure what that "does not appear to be a valid blob key" message is saying. QUESTION: What settings do I need to check in Ambari that would tell Storm to allow all secured users to be able to submit a topology? << reminder; I do NOT have Ranger installed Any assistance, even a hint, would be greatly appreciated!!
... View more
Labels:
03-05-2017
12:50 AM
1 Kudo
As a quick follow-up to @Sunile Manjee's perfect answer, check out https://martin.atlassian.net/wiki/x/0zToBQ for how to grant DBA-level privileges to these new namespaces. Heck, here it is in from the shell's perspective. grant 'the_username', 'RWXC', '@my_ns'
... View more
03-04-2017
05:04 PM
2 Kudos
Great question. My solution below is trimmed out of the presentation described at https://martin.atlassian.net/wiki/x/GYBzAg on a much bigger topic. So, let's assume you have an this original file that has an ID, a date created, and three "payload" attributes. [root@sandbox ~]# hdfs dfs -cat origs.csv
11,2014-09-17,base,base,base
12,2014-09-17,base,base,base
13,2014-09-17,base,base,base
14,2014-09-18,base,base,base
15,2014-09-18,base,base,base
16,2014-09-18,base,base,base
17,2014-09-19,base,base,base
18,2014-09-19,base,base,base
19,2014-09-19,base,base,base Now, let's assume you have a delta file that has 4 new records (ID's 10, 20, 21 and 22) as well as more recent versions of 3 other records (IDs 12, 14, and 16). [root@sandbox ~]# hdfs dfs -cat delta.csv
10,2014-09-16,oops,was,missed
20,2014-09-20,base,base,base
21,2014-09-20,base,base,base
22,2014-09-20,base,base,base
12,2014-09-17,base,CHANGED,base
14,2014-09-18,base,CHANGED,base
16,2014-09-18,base,CHANGED,base Then, in a Pig script you could join these like this. origs = LOAD '/user/maria_dev/hcc/86778/original.csv'
USING PigStorage(',') AS
( bogus_id:int, date_cr: chararray,
field1:chararray, field2:chararray, field3:chararray );
delta = LOAD '/user/maria_dev/hcc/86778/delta.csv'
USING PigStorage(',') AS
( bogus_id:int, date_cr: chararray,
field1:chararray, field2:chararray, field3:chararray );
joined = JOIN origs BY bogus_id FULL OUTER, delta BY bogus_id;
DESCRIBE joined;
DUMP joined; And get this output. joined: {origs::bogus_id: int,origs::date_cr: chararray,origs::field1: chararray,origs::field2: chararray,origs::field3: chararray,delta::bogus_id: int,delta::date_cr: chararray,delta::field1: chararray,delta::field2: chararray,delta::field3: chararray}
(,,,,,10,2014-09-16,oops,was,missed)
(11,2014-09-17,base,base,base,,,,,)
(12,2014-09-17,base,base,base,12,2014-09-17,base,CHANGED,base)
(13,2014-09-17,base,base,base,,,,,)
(14,2014-09-18,base,base,base,14,2014-09-18,base,CHANGED,base)
(15,2014-09-18,base,base,base,,,,,)
(16,2014-09-18,base,base,base,16,2014-09-18,base,CHANGED,base)
(17,2014-09-19,base,base,base,,,,,)
(18,2014-09-19,base,base,base,,,,,)
(19,2014-09-19,base,base,base,,,,,)
(,,,,,20,2014-09-20,base,base,base)
(,,,,,21,2014-09-20,base,base,base)
(,,,,,22,2014-09-20,base,base,base) You'll see above that if the delta record's fields are present (the ones on the right side) then they should be the ones carried forward as they are either new (4) or modified (3) records, but if they are missing (6) then the original values should just roll forward. merged = FOREACH joined GENERATE
((delta::bogus_id is not null) ? delta::bogus_id: origs::bogus_id) as bogus_id,
((delta::date_cr is not null) ? delta::date_cr: origs::date_cr) as date_cr,
((delta::field1 is not null) ? delta::field1: origs::field1) as field1,
((delta::field2 is not null) ? delta::field2: origs::field2) as field2,
((delta::field3 is not null) ? delta::field3: origs::field3) as field3;
DESCRIBE merged;
DUMP merged; As you can see from the combined output, we have the necessary 13 rows in the new dataset. merged: {bogus_id: int,date_cr: chararray,field1: chararray,field2: chararray,field3: chararray}
(10,2014-09-16,oops,was,missed)
(11,2014-09-17,base,base,base)
(12,2014-09-17,base,CHANGED,base)
(13,2014-09-17,base,base,base)
(14,2014-09-18,base,CHANGED,base)
(15,2014-09-18,base,base,base)
(16,2014-09-18,base,CHANGED,base)
(17,2014-09-19,base,base,base)
(18,2014-09-19,base,base,base)
(19,2014-09-19,base,base,base)
(20,2014-09-20,base,base,base)
(21,2014-09-20,base,base,base)
(22,2014-09-20,base,base,base) Good luck and happy Hadooping!
... View more
03-03-2017
03:23 PM
2 Kudos
https://martin.atlassian.net/wiki/x/C4BRAQ shows an example (and old one that I did back with Hue) of loading the jar file into HDFS and then registering it like in the below example. REGISTER 'hdfs:///user/hue/shared/pig/udfs/exampleudf.jar';
DEFINE SIMPLEUPPER exampleudf.UPPER();
typing_line = LOAD '/user/hue/testData/typingText.txt' AS (row:chararray);
upper_typing_line = FOREACH typing_line GENERATE SIMPLEUPPER(row);
DUMP upper_typing_line; Good luck & happy Hadooping!
... View more
02-26-2017
06:18 PM
For Pig and Hive implementations, I'd suggest you create a UDF. If new territory for you, here are some quick blog posts on creating (simple) UDFs for Pig and Hive; https://martin.atlassian.net/wiki/x/C4BRAQ and https://martin.atlassian.net/wiki/x/GoBRAQ. Good luck
... View more
02-26-2017
06:13 PM
No current plans to do that. We had vendors that host the exams which we have to pay and cannot turn this into an expense for ourselves; sorry. That said, we are ~~~considering~~~ bundling in a voucher with our https://hortonworks.com/services/training/class/hadoop-essentials/ class offering, but again, that is only an internal discussion at this time. Best of luck when you take the test!!
... View more
02-20-2017
12:27 AM
I've got a working example at https://github.com/lestermartin/oss-transform-processing-comparison/tree/master/profiling#hive that shows column stats.
... View more
02-13-2017
02:36 PM
Good point. For my Sandbox testing, I decided to just use the steps provided in http://stackoverflow.com/questions/40550011/zeppelin-how-to-restart-sparkcontext-in-zeppelin to stop the SparkContext when I need to do something outside of Zeppelin. Not ideal, but working good enough for some multi-framework prototyping I'm doing.
... View more