Member since
β07-28-2021
1
Post
2
Kudos Received
0
Solutions
β07-29-2021
01:22 AM
2 Kudos
In our organization we have noticed that the native Power BI connector for Impala that is provided by Microsoft is not able to communicate with a Cloudera CDP cluster that is "behind" an Apache Knox reverse proxy (using Apache Knox means that among other things the transport mode 'HTTP' is needed for ODBC connections). While the different ways to connect via ODBC to Python, R, or other common end-user tools are extensively documented here on this site, we have so far not found any documentation of a way to set up Power BI with Direct Query capabilities in a cluster with Impala behind Apache Knox.
Using a custom connector enables Direct Query with Impala via Apache Knox
In order to achieve connectivity with Impala with Direct Query capabilities in Power BI (computations are sent to the cluster via ODBC), we created a fork of the Impala connector that Microsoft provides for Power BI and inserted the changes to the connection string that are needed in the setup I sketched above. Since the Cloudera Impala ODBC driver does not provide HTTP connectivity with Kerberos authentication, we use the HiveServer2 protocol and communicate with Impala via the Hive ODBC driver (needs to be installed on the machine running Power BI). I admit this might sound strange but so far it has worked quite well for us (see also here documentation of this setup: Configuring Client Access to Impala).
The custom connector we built for Power BI was first created as an open-source project in our free time using Visual Studio Community Edition and the Power Query SDK following this tutorial, and we released it to Github. The Knox topologies in your cluster need to be named in a certain (standard) way for the current build to work, but one could fix this in a future release. You can also fork our repo and build connectors with re-named topologies quite easily. Some more information can be found in the README on Github. Below we provide a small usage example.
Example Usage
Obtain the custom connector via Github (i.e. the built custom connector with the ending .mez) and place it in the folder [Documents]\Power BI Desktop\Custom Connectors and install the Hive ODBC driver (example, Cloudera ODBC Driver for Apache Hive)
Open Power BI Desktop and allow custom connectors (Connector extensibility in Power BI)
Restart Power BI Desktop
Select the new connector for Impala and set it up with the parameters of your cluster...
a) Select the beta version of the custom connector (listed next to the standard Impala connector)
b) Insert the hostname and the port number of your Knox Gateway and always (!) select Direct Query in order to use cluster resources for computations
c) Select the authentication method available via Knox (with Kerberos set up in a Knox Topology, use "Windows", otherwise use "Standard")
d) Select the tables you want to access using the Direct Query capabilities (the table shows a snippet of the EUREX Public Dataset obtained via Deutsche BΓΆrse Public Dataset.π
Summary
Now you are ready to create dashboards and other apps in Power BI Desktop using data and compute capabilities of Impala in a cluster setting with an Apache Knox Reverse Proxy for communication with Cloudera CDP.
Disclaimer: This article is contributed by an external user. The steps may not be verified by Cloudera and may not be applicable for all use cases and may be very specific to a particular distribution. Please follow with caution and at your own risk. If needed, raise a support case to get the confirmation.
... View more