About AceWinner

AceWinner · ‎02-15-2018

Found the problem a couple of minutes after posting this. After reviewing installation logs, the RPM package did not install successfully... but was reported as successful : [...] Running Transaction Installing : ambari-server-2.6.1.0-143.x86_64 1/1 /var/tmp/rpm-tmp.SHIahM: line 27: //var/lib/ambari-server/install-helper.sh: Permission denied [...] ==> Turns out the /var mount point on this very specific machine (configuration management issue) was mounted noexec. Fixed the mount, reinstalled the RPM and I can proceed now.

AceWinner · ‎02-15-2018

Getting a weird issue : "ambari-server setup" fails instantly when running it for the first time. Getting this : "line 84: line: unbound variable" context : Trying to install a new HDP 2.6 cluster onto physical hardware (for a POC). Running CentOS 6.9 - fresh OS install. Following this doc : https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.1.0/bk_ambari-installation/content/set_up_the_ambari_server.html Added the Ambari repo, installed the ambari package(ambari-server-2.6.1.0-143.x86_64) then proceeded to follow the rest of the documentation but can't find the issue here. It's not the first time I spool up such cluster but it is the first time with this specific version - any ideas what could be the issue? Line 84 is a simple version calculation : "numversion=$(( 10 * $majversion + $minversion))" not sure where this is coming from.

AceWinner · ‎08-10-2017

Was about to delete my question (if that's even possible) but I managed to isolate the issue after triple-checking everything so I figure I'd post that here. Turns out the 3 accounts in AD did not have the SPN set (servicePrincipalName) correctly. Changed them to "HTTP/myhost.mydomain.org" and everything works great now. Lessons learned : don't eyeball the correctness of properties : copy paste them in an editor and check them there.

AceWinner · ‎08-10-2017

Hi, First a couple of things that work : -Nifi cluster running on 3 nodes (running Apache upstream V1.1.2) on Centos 6 -Login identity provider is kerberos and works A1 using username & password fields in the UI or though the API. -Once I get an API token using username & password, I can query the API without any issues. -KDC is Active Directory -Service is using nifi.kerberos.service.principal as "serviceaccount@DOMAIN.ORG" with corresponding keytab. Now I'm trying a little POC where I want a script to use the API to interact with some of my flows. Because I don't want to store a username & password for the script to use, I wanted to setup SPNEGO and just use a plain "kinit" with a keytab and fetch the API access token using something like : "curl --negotiate -X POST -v -u : https://myhost.mydomain.org:8989/nifi-api/access/kerberos" Steps I've taken: 1-Create 3 new accounts in AD with the logon name (principal) named "HTTP/myhost.mydomain.org". One for each of of my machine. 2-Created a keytab for each of those machine by using ktutil. Tested this with a "kinit HTTP/myhost.mydomain.org@DOMAIN.ORG" and they work. 3-Setup the 3 SPNEGO properties in nifi.properties : -nifi.kerberos.spnego.principal=HTTP/myhost.mydomain.org@DOMAIN.ORG -nifi.kerberos.spnego.keytab.location=[the_location_of_the_key_tab] -nifi.kerberos.spnego.authentication.expiration=12 hours After a service restart, I try to run the curl command mentioned above and I get the following error : curl --negotiate -X POST -v -u : https://myhost.mydomain.org:8989/nifi-api/access/kerberos * About to connect() to myhost.mydomain.org port 8989 (#0) * Trying [îp address]... connected * Connected to myhost.mydomain.org ([îp address]) port 8989 (#0) * Initializing NSS with certpath: sql:/etc/pki/nssdb * CAfile: [redacted] CApath: none * NSS: client certificate not found (nickname not specified) * SSL connection using [redacted] * Server certificate: * [redacted] > POST /nifi-api/access/kerberos HTTP/1.1 > User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.21 Basic ECC zlib/1.2.3 libidn/1.18 libssh2/1.4.2 > Host: myhost.mydomain.org:8989 > Accept: */* > < HTTP/1.1 401 Unauthorized < Date: Thu, 10 Aug 2017 17:10:41 GMT * gss_init_sec_context() failed: : Server not found in Kerberos databaseWWW-Authenticate: Negotiate < Content-Type: text/plain < Content-Length: 0 < Server: Jetty(9.3.9.v20160517) < * Connection #0 to host myhost.mydomain.org left intact * Closing connection #0 I was wondering what's missing. DNS and Reverse DNS is setup properly and everything else is working fine (like HUE, also using SPNEGO with the same method). Any idea?

AceWinner · ‎11-24-2016

@Matt I'm not 100% sure swapping was the problem here. Modified all of the flows to avoid getting big queues... and bumped the swap threshold to 40000 and we're still experiencing disk growth + unknown file on reboot... I did notice something weird : some of the flows have their "error" or "failure" sending back to themselves instead of auto-termination. Not sure if this is a good practice or not and that it could contribute to the problem?

AceWinner · ‎11-22-2016

We were definitely swapping. We had a bunch of queue in excess of 40-50K. In all cases, the culprit was a merge processor trying to do too big buckets and waiting for too long. I've modified the flows and stacked 2 merge processor one behind the other (first one has a max of 1000 items, 2nd one does the actual merging to our specific size). I'll monitor the situation and see if the problem occurs again. I'm down to 7-8K flow files (from 450K+) in total.

AceWinner · ‎11-21-2016

Here's the actual error message. We'll have tons of them (more than 100k) during the restart... 2016-11-21 20:41:43,056 INFO [main] o.a.n.c.repository.FileSystemRepository Found unknown file [nifipath]/content_repository/39/1479172392813-1092647 (5845 bytes) in File System Repository; removing file

AceWinner · ‎11-21-2016

Thanks @Matt, Clearing the queues does not seem to help. I'm restating one of the nodes right now, I'll be able to share the exact message when it boots and discovers the files that should not be there - sounds a lot like we're hitting the bug. Is there a timeline for the release of 1.1.0? Reading the mailing lists, it seems to be really close to RC. Thanks Phil

AceWinner · ‎11-21-2016

Hello, First time posting here so sorry if this is in the wrong section / wrong format. First, some background : We started a POC using NiFi 1.0.0. We're using a 3 node cluster with limited ressources (this is a POC...). Each of the node has 16 cores, 32gb of ram and 2 volumes : a raid 1 volume for the OS and a Raid 10 volume on 2.5in splindles. I know this is not a recommended setup but the content repo, the provenance repo, the flow files, everything basically, is on the same raid 10 array. The disks are heavily used right now. Content Repo archiving is disabled. Now here's the thing : every 2-3 days, the disk fills up. Right now, the UI reports that we have, in queue : 450 000 (3.21gb). I would expect to have roughly the same amount of data in the nifi/content_repository folder but it's not the case : On one of the node, the content_repo folder is 73gb. I can't tell how big the 2 others nodes are since the "du -h" operation is still running after 10minutes but using "df", I can estimate around 700-800gb on each. When we restart one of the node, it can take hours while the process cleans the content_repo and spams the log with a bunch of "unknown files" Any ideas / Suggestions? This is running on CentOS 6. Thanks Here's the relevant config section : nifi.content.repository.implementation=org.apache.nifi.controller.repository.FileSystemRepository nifi.content.repository.directory.default=./content_repository nifi.content.repository.archive.max.retention.period=1 hours nifi.content.repository.archive.max.usage.percentage=1% nifi.content.repository.archive.enabled=false nifi.content.repository.always.sync=false

AceWinner · ‎10-21-2013

Found the problem, turns out that the agent uses the localhost connection to do its stuff. I added a simple rule : -A INPUT -s 127.0.0.1/32 -m conntrack --ctstate NEW -j ACCEPT And it fixed the problem.

Online	Offline
Last Visited	‎10-21-2013 11:27 AM

Member Since	‎10-18-2013 02:49 PM
Last Visited	‎10-21-2013 11:27 AM
Posts	11

Cloudera Community

Re: "ambari-server setup" fails instantly with "li...

Re: Getting "Server not found in Kerberos database...

Re: Issue with iptables and SCM Agent

Re: "ambari-server setup" fails instantly with "li...

"ambari-server setup" fails instantly with "line 8...

Re: Getting "Server not found in Kerberos database...

Getting "Server not found in Kerberos database" wh...

Re: NiFi 1.0.0 does not seem to be cleaning up its...

Re: NiFi 1.0.0 does not seem to be cleaning up its...

Re: NiFi 1.0.0 does not seem to be cleaning up its...

Re: NiFi 1.0.0 does not seem to be cleaning up its...

NiFi 1.0.0 does not seem to be cleaning up its con...

Re: Issue with iptables and SCM Agent