03-21-2019 01:09 PM - last edited on 03-21-2019 02:22 PM by cjervis
I'm working with Trifacta on updating an integration with Cloudera Navigator, and am running into a couple of strange issues that I was hoping you all could help me with.
We have two test environments for CDH 6: a local test environment, running the proof-of-concept install of Cloudera Enterprise 6.0.1 and a cluster running a fully-distributed install of Cloudera Enterprise 6.0.1. Both environments are running the same build of 6.0.1, namely "(#610811 built by jenkins on 20181002-0044 git: c1c9ad537961941820867e39b0a76feb7653f9be)".
In both cases, we are trying to publish three very simple entities as the result of an execution of Trifacta's batch processing engine. We are trying to publish two operation entities. Each of the operation entities has two dataflow relationships (one source, one target) to files or directories on HDFS; it uses EndPointProxys to refer to those files, since Navigator will create those entities itself when it crawls. We also create an operation-execution entity with an instance-of relationship to one of the aforementioned operation entities. The output HDFS entity for the first operation entity is the same as the input HDFS entity for the second, which should result in a nice lineage graph linking the two entities to one another and to their respective inputs and outputs, e.g.
input_file --(dataflow relationship)--> operation A --(dataflow relationship)--> intermediate_file --(dataflow relationship)--> operation B --(dataflow relationship)--> HDFS Output
operation A --(instance-of relationship)--> operation-execution C
(Forgive my mediocre ASCII-graph drawing skills, but hopefully you get the idea).
These entities are created via version 2.2 of the Navigator Java SDK. The three entities are created via two calls to the Navigator SDK, which in turn issues two HTTP POST requests to Navigator itself (specifically, the endpoint "/api/v13/metadata/plugin"). I've captured the payloads to both those requests here: https://gist.github.com/alexras/071ff8a4552d27b5f98c60deb7e292b7 . Every time we submit requests to Navigator, those requests succeed.
In the local test environment, the new entities are visible in the Navigator UI almost immediately, while the relationships take on the order of tens of minutes to become completely visible. This is consistent with what we expect to see, given our (admittedly limited) understanding of Navigator's internals and its asynchronous processing of updates.
In the clustered environment, however, things behave a little more strangely. Entities show up fairly reliably after only a couple of minutes of delay. However, despite waiting several hours for relationships to completely persist, we almost always end up with entities that are missing their input relationships, output relationships, or both.
I've tried configuring the SDK to use v9 and v13 of the API. The fact that this works in the proof-of-concept version but not in the clustered install initially made me think that it was a problem with the cluster itself, but I can find nothing in the logs that immediately indicates what might be wrong, or even what might be different in terms of how the two installations of Navigator are behaving. I've tried turning autocommit on, but that seems to have had no effect. I've found the part of the debug UI where I can modify log levels for different classes (and increased the log level on org.apache.cxf.interceptor to capture the payloads in the gist above), but have no idea which other classes I should bump the logging level for in order to further debug this.
I'm a bit at a loss as to how to proceed here. Any help you could give me would be greatly appreciated.