About jknulst

jknulst · ‎01-26-2017

@Joshua Petree Don't forget to mark the question as answered, if it is answered

nisha_menon16 · ‎10-12-2017

@gnovak @tuxnet Would resource sharing still work if ACLs are configured for separate tenant queues? If ACLs are different for Q1 and Q2, will it still support elasticity and preemption? Could you also please share the workload/application details that you used for these experiments? I am trying to run some experiments to do a similar test for elasticity and preemption of capacity schedulers. I am using a simple Spark word count application on a large file for the same, but I am not able to get a feel of resource sharing among queues using this application. Thanks in advance.

aiyer1 · ‎01-19-2017

@Jasper Trident HDFS State does provide exactly once guarantee and de-duplication is taken care of. So if a batch is replayed by trident (due to failures), the trident state implementation automatically removes duplicates from the current file by copying the data up to the last completed batch to another file. Since this operation involves a lot of data copy, ensure that the data files are rotated at reasonable sizes with FileSizeRotationPolicy and at reasonable intervals with TimedRotationPolicy so that the recovery can complete within topology.message.timeout.secs.

jknulst · ‎01-25-2017

With the help of the remarks by @Aaron Dossett I found a solution to this. Knowing that Storm does not mark the hdfs file currently being written to, and the .addRotationAction not robust enough in extreme cases I turned to a low level solution. HDFS can report the files on a path that are open for write: hdfs fsck <storm_hdfs_state_output_path> -files -openforwrite or alternatively you can just list only NON open files on a path: hdfs fsck <storm_hdfs_state_output_path> -files The output is quite verbose but you can use sed or awk to get closed/completed files from there. (Java HDFS api has similar hooks, this is just for CLI level solution)

jknulst · ‎01-03-2017

@Ted Yu Thanks, ah yeah, you know, there is always a reason not to upgrade it seems. Maybe this is another reason to contemplate the upgrade.

jknulst · ‎12-29-2016

Running a hadoop client on Mac OS X and connect to a Kerberized cluster poses some extra challenges. I suggest to use brew, the Mac package manager to conveniently install the Hadoop package: $ brew search hadoop $ brew install hadoop This will install the latest (apache) Hadoop distro, (2.7.3 at the time of writing). Minor version differences to your HDP version will not matter. You may test the installation by running a quick 'hdfs dfs -ls / ' on HDFS. Without further configuration a local single node 'cluster' will be assumed. We now have to point the client to the real HDP cluster. In order to do so you need to copy the full contents of the config files below from any HDP node: Source: /etc/hadoop/{hdp-version}/0/hadoop-env.sh /etc/hadoop/{hdp-version}/0/core-site.xml /etc/hadoop/{hdp-version}/0/hdfs-site.xml /etc/hadoop/{hdp-version}/0/yarn-site.xml Target: /usr/local/Cellar/hadoop/2.7.3/libexec/etc/hadoop/hadoop-env.sh /usr/local/Cellar/hadoop/2.7.3/libexec/etc/hadoop/core-site.xml /usr/local/Cellar/hadoop/2.7.3/libexec/etc/hadoop/hdfs-site.xml /usr/local/Cellar/hadoop/2.7.3/libexec/etc/hadoop/yarn-site.xml If we now try to access the Kerberized cluster we get an error like below: Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:737) at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528) at org.apache.hadoop.ipc.Client.call(Client.java:1451) ... 28 more Sure, we need to kinit first so we do: $ kinit test@A.EXMAPLE.COM test@A.EXMAPLE.COM's password: $ hdfs dfs -ls / We still get the same error, so what is going on? It makes sense to add this extra option (-Dsun.security.krb5.debug=true) to hadoop-env.sh now, to enable Kerberos debug log output : export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true -Dsun.security.krb5.debug=true ${HADOOP_OPTS}" Now the debug output provides some clues: $ hdfs dfs -ls / Java config name: null Native config name: /Library/Preferences/edu.mit.Kerberos Loaded from native config 16/12/29 17:02:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable >>>KinitOptions cache name is /tmp/krb5cc_502 >> Acquire default native Credentials default etypes for default_tkt_enctypes: 23 16. >>> Found no TGT's in LSA By default the HDFS clients looks for Kerberos tickets at /tmp/krb5cc_502 where '502' is the variable uid of the relevant user. The other thing to look at is 'Native config name: /Library/Preferences/edu.mit.Kerberos' , this is where your local Kerberos configs are sourced from. Another valid config source would be '/etc/krb5.conf ' depending on your local installation. You can source and mirror this local config from any HDP nodes from the /etc/krb5.conf file. Now if we look at the default ticket cache on a Mac OS X it seems to point to another location: $ klist Credentials cache: API:XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXXX Principal: test@A.EXMAPLE.COM Issued Expires Principal Dec 29 17:02:45 2016 Dec 30 03:02:45 2016 krbtgt/A.EXMAPLE.COM@A.EXMAPLE.COM The pointer 'API:XXXXXX-XXXXX-XXXX-XXXXX' signals Mac OS X' memory-based credential cache for Kerberos. On a nix distro it would typically say something like 'Ticket cache: FILE:/tmp/krb5cc_502'. The location to store the ticket cache can be set by the environment variable KRB5CCNAME (FILE: / DIR: / API: / KCM: / MEMORY:) but that is beyond the scope of this article. This is why the HDFS client could not find any ticket. If the HDFS client looks for the ticket cache at '/tmp/krbcc_502' we can simply make Mac OS X cache a validated Kerberos ticket there like this: $ kinit -c FILE:/tmp/krb5cc_502 test@A.EXMAPLE.COM test@A.EXMAPLE.COM's password: Or likewise with a keytab: $ kinit -c FILE:/tmp/krb5cc_502 -kt ~/Downloads/smokeuser.headless.keytab ambari-qa-socgen_shadow@MIT.KDC.COM Check the ticket cache the same way: $ klist -c /tmp/krb5cc_502 Credentials cache: FILE:/tmp/krb5cc_502 Principal: test@A.EXMAPLE.COM Issued Expires Principal Dec 29 17:31:29 2016 Dec 30 03:31:29 2016 krbtgt/A.EXMAPLE.COM@A.EXMAPLE.COM If you try to list hdfs again now it should look something like this: $ hdfs dfs -ls /user Java config name: null Native config name: /Library/Preferences/edu.mit.Kerberos Loaded from native config 16/12/29 17:34:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable >>>KinitOptions cache name is /tmp/krb5cc_502 >>>DEBUG <CCacheInputStream> client principal is test@A.EXMAPLE.COM >>>DEBUG <CCacheInputStream> server principal is krbtgt/A.EXMAPLE.COM@A.EXMAPLE.COM >>>DEBUG <CCacheInputStream> key type: 18 >>>DEBUG <CCacheInputStream> auth time: Thu Dec 29 17:31:29 CET 2016 >>>DEBUG <CCacheInputStream> start time: Thu Dec 29 17:31:29 CET 2016 >>>DEBUG <CCacheInputStream> end time: Fri Dec 30 03:31:29 CET 2016 >>>DEBUG <CCacheInputStream> renew_till time: Thu Jan 05 17:31:27 CET 2017 >>> CCacheInputStream: readFlags() FORWARDABLE; RENEWABLE; INITIAL; PRE_AUTH; >>>DEBUG <CCacheInputStream> client principal is test@A.EXMAPLE.COM >>>DEBUG <CCacheInputStream> server principal is X-CACHECONF:/krb5_ccache_conf_data/fast_avail/krbtgt/A.EXAMPLE.COM@A.EXAMPLE.COM@MIT.KDC.COM >>>DEBUG <CCacheInputStream> key type: 0 >>>DEBUG <CCacheInputStream> auth time: Thu Dec 29 17:31:21 CET 2016 >>>DEBUG <CCacheInputStream> start time: null >>>DEBUG <CCacheInputStream> end time: Thu Dec 29 17:31:21 CET 2016 >>>DEBUG <CCacheInputStream> renew_till time: null >>> CCacheInputStream: readFlags() >>> KrbCreds found the default ticket granting ticket in credential cache. >>> Obtained TGT from LSA: Credentials: client=test@A.EXMAPLE.COM server=krbtgt/A.EXMAPLE.COM@A.EXMAPLE.COM authTime=20161229163129Z startTime=20161229163129Z endTime=20161230023129Z renewTill=20170105163127Z flags=FORWARDABLE;RENEWABLE;INITIAL;PRE-AUTHENT EType (skey)=18 (tkt key)=18 16/12/29 17:34:30 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. Found ticket for test@A.EXAMPLE.COM to go to krbtgt/A.EXAMPLE.COM@A.EXAMPLE.COM expiring on Fri Dec 30 03:31:29 CET 2016 Entered Krb5Context.initSecContext with state=STATE_NEW Found ticket for test@A.EXAMPLE.COM to go to krbtgt/A.EXAMPLE.COM@A.EXAMPLE.COM expiring on Fri Dec 30 03:31:29 CET 2016 Service ticket not found in the subject >>> Credentials acquireServiceCreds: main loop: [0] tempService=krbtgt/MIT.KDC.COM@A.EXAMPLE.COM default etypes for default_tgs_enctypes: 23 16. >>> CksumType: sun.security.krb5.internal.crypto.RsaMd5CksumType >>> EType: sun.security.krb5.internal.crypto.Aes256CtsHmacSha1EType >>> KdcAccessibility: reset ...... ....S H O R T E N E D.. ...... Found 4 items drwxrwx--- - ambari-qa hdfs 0 2016-12-19 21:56 /user/ambari-qa drwxr-xr-x - centos centos 0 2016-11-30 12:07 /user/centos drwx------ - hdfs hdfs 0 2016-11-29 12:38 /user/hdfs drwxrwxrwx - j.knulst hdfs 0 2016-12-29 13:40 /user/j.knulst So directing your Kerberos tickets on Mac OS X to the anticipated ticket cache with the ' -c ' switch will help a lot.

jknulst · ‎12-30-2016

Well eventually I was able to solve all this. I did multiple things, don't know exactly what solved it. -Installed Mac_OS_X_10.4_10.6_Kerberos_Extras.dmg -Upgraded Firefox to 50.1.0 from 49.x -Reset value of 'network.negotiate-auth.trusted-uris' in Firefox about:config to '.field.hortonworks.com' -Mapped all cluster nodes short and long fqdn in local /etc/hosts like 1xx.2x.x3x.220 sg-hdp24-mst6b sg-hdp24-mst6b.field.hortonworks.com The local Kerberos config at /etc/krb5.conf has to have both REALMS: [libdefaults] default_realm = MIT.KDC.COM [domain_realm] .field.hortonworks.com = MIT.KDC.COM field.hortonworks.com = MIT.KDC.COM [realms] FIELD.HORTONWORKS.COM = { admin_server = xxxx.field.hortonworks.com kdc = ad01.field.hortonworks.com } MIT.KDC.COM = { admin_server = sg-hdp24-mst6b.field.hortonworks.com kdc = sg-hdp24-mst6b.field.hortonworks.com } Both curl and webhdfs calls from Firefox work now. After such a successful call local cache looks like this: $ klist Credentials cache: API:C1AAF010-41BB-4705-B4FB-239BC06DCF8E Principal: jk@FIELD.HORTONWORKS.COM Issued Expires Principal Dec 30 20:34:42 2016 Dec 31 06:34:42 2016 krbtgt/FIELD.HORTONWORKS.COM@FIELD.HORTONWORKS.COM Dec 30 20:34:49 2016 Dec 31 06:34:42 2016 krbtgt/MIT.KDC.COM@FIELD.HORTONWORKS.COM Dec 30 20:34:49 2016 Dec 31 06:34:42 2016 HTTP/sg-hdp24-mst7@MIT.KDC.COM So now the cross realm trust cluster MIT --> AD is fully functional. One peculiar thing was that in Firefox the SPNEGO auth works just as well now for the destination 'http://sg-hdp24-mst7:50070/webhdfs/v1/?op=LISTSTATUS' as it is for http://sg-hdp24-mst7.field.hortonworks.com:50070/webhdfs/v1/?op=LISTSTATUS. So somehow Firefox figured out it needed to use Kerberos to auth without the domain indicator ('network.negotiate-auth.trusted-uris')

ahallam · ‎02-09-2017

FYI. "Multiple Forest" is supported - but not "Cross Forest" AD. If you have "Cross Forest" AD, Ranger may able to get users from the right branch but not groups or vice versa

jknulst · ‎11-25-2016

@Manoj Dhake Hi, Atlas and Falcon serve very different purposes, but there are some areas where they touch base. Maybe that is where your confusion comes from. Atlas: -really like an 'atlas' to almost all of the metadata that is around in HDP like Hive metastore, Falcon repo, Kafka topics, Hbase table etc. This single view on metadata makes for some powerfull searching capabilities on top of that with full text search (based on solr) -Since Atlas has this comprehensive view on metadata it is also capable of providing insight in lineage, so it can tell by combining Hive DDL's what table was the source for another table. -Another core feature is that you assign tags to all metadata entities on Atlas. So you can say that column B in Hive table Y holds sensitive data by assigning a 'PII' tag to it. But a hdfs folder can also be assigned a 'PII' tag or a CF from Hbase. From there you can create tag based policies from Ranger to manage access to anything 'PII' tagged in Atlas. Falcon: -more like a scheduling and execution engine for HDP components like Hive, Spark, hdfs distcp, Sqoop to move data around and/or process data along the way. In a way Falcon is a much improved Oozie. -metadata of Falcon dataflows is actually sinked to Atlas through Kafka topics so Atlas knows about Falcon metadata too and Atlas can include Falcon processes and its resulting meta objects (tables, hdfs folders, flows) into its lineage graphs. I know that in the docs both tools claim the term 'data governance', but I feel Atlas is more about that then Falcon is. It is not that clear what Data Governance actually is. With Atlas you can really apply governance by collecting all metadata querying and tagging it and Falcon can maybe execute processes that evolve around that by moving data from one place to another (and yes, Falcon moving a dataset from an analysis cluster to an archiving cluster is also about data governance/management) Hope that helps

sramesh · ‎10-19-2016

Looks like this is not supported in UI. You can do a it through REST API or command line. Refer https://falcon.apache.org/restapi/ExtensionSubmitAndSchedule.html or update existing mirror jobs using https://falcon.apache.org/restapi/ExtensionUpdate.html Add "removeDeletedFiles=true" as POSt parameter. Thanks!

Online	Offline
Last Visited	‎12-05-2024 10:10 AM

Member Since	‎08-15-2016 02:39 PM
Last Visited	‎12-05-2024 10:10 AM
Posts	189
Kudos received	63

Cloudera Community

Re: Metron Connection failed [Errno 111] error on ...

Re: How do I verify all of my data from enrichment...

Re: Metron profiler client unbale to load native h...

Re: Kafka ConsumerGroupCommand Error

Re: Metron Parser Fail to start with Ambari 2.5 an...

Re: Single Node Configuration Questions

Re: Yarn queue limits don't apply

Re: Storm exactly-once semantics

Re: Storm HDFS Bolt question (Trident api)

Re: Hbase MOB and HDP2.4

Connect Hadoop client on Mac OS X to Kerberized HD...

Re: SPNEGO issue after setting up MIT KDC one-way ...

Re: Ranger, Knox integration with Multiple Forest ...

Re: What is the difference between Apache atlas an...

Re: Falcon HDFS Mirror deletions