Innovation Blog

[ANNOUNCE] Updates to the dbt adapters for Apache Hive, Apache Impala, Apache Spark (Livy), and Spark CDE

Cloudera Employee

Project Proposal Form.png

This week’s release includes:

Adapters & docker images:

For CML/CDSW deployment

  • public.ecr.aws/d7w2o6p0/dbt-cml:1.2.0 (with Jupyter Interface):

    Name

    Version

    dbt-core

    1.2.0

    dbt-impala

    1.2.0

    dbt-hive

    1.2.0

    dbt-spark-cde

    1.2.0

    dbt-spark-livy

    1.2.0

    CML Base image

    ml-runtime-jupyterlab-python3.9-standard:2022.04.1-b6

    .py scripts (Utility)

    n/a

  • public.ecr.aws/d7w2o6p0/dbt-cdsw:1.2.0 (with workbench editor): 

    Name

    Version

    dbt-core

    1.2.0

    dbt-impala

    1.2.0

    dbt-hive

    1.2.0

    dbt-spark-cde

    1.2.0

    dbt-spark-livy

    1.2.0

    CML Base image

    ml-runtime-workbench-python3.9-standard:2022.04.1-b6

    .py scripts (Utility)

    n/a

 

Note: Both CML and CDSW docker image works with CDSW, though later one only support workbench editor while setting up jobs.

Supported infrastructure:

  • All adapters we have released support CDP Public Cloud LDAP with Knox
  • Our Impala and Hive adapters support CDP Private Cloud with Kerberos, we are testing our Spark adapters for the same
  • Both Impala and hive adapters support Local Server without authentication

Deployment Options:

 Form FactorOn-PremCloudCloudCloudCloud
  PvC Data ServicesCDPOneCDP PaaS Data ServicesCDP PaaS Data ServicesCDP PaaS Datahub
dbt SDLC requirements CDSWCMLCMLCDECML
Tested artifactsdbt core and adaptersCustom runtimeCustom RuntimeCustom RuntimePypi packagesCustom Runtime
Authoring/testingdbt developCDSW workbench sessionCML Jupyter SessionCML Jupyter SessionVSCode/Other IDECML Jupyter Session
Orchestrationdbt runCDSW jobCML jobsCML jobsCompile dbt models to airflow dag
OR
Run dbt run as custom bash operator 
CML jobs
Collaborationdbt doc serveCDSW AppCML AppCML App

Flask server

OR

S3

CML App 

Past adapter releases:

dbt-hive adapter

  • 1.3.0 (Nov 24th, 2022)
  • 1.2.0 (Nov 4th, 2022)
    • Now dbt-hive adapter supports dbt core 1.2.0.
  • 1.1.5 (Oct 28th, 2022)
  • 1.1.4 (Sep 23rd, 2022)
    • Adding support for Kerberos auth mechanism. Along with an updated instrumentation package.
  • 1.1.3 (Sep 9th, 2022)
    • Added a macro to detect the hive version, to determine if the incremental merge is supported by the warehouse.
      hajmera_1-1664772416013.png
  • 1.1.2 (Sep 2nd, 2022)
    • dbt seeds command won't add additional quotes to string, which was a known bug in the previous release. All warehouse properties(Cluster_by, Comment, external table, incremental materialization methods, etc) are tested and should be working smoothly with the adapter. Added instrumentation to the adapter

  • 1.1.1 (August 23rd, 2022)
    • Cloudera released the first version of the dbt-hive adapter

dbt-impala adapter

  • 1.3.0 (Nov  18th, 2022)
  • 1.2.0 (Nov 2nd, 2022)
    • Now dbt-impala adapter supports dbt core 1.2.0
  • 1.1.5  (Oct 28th, 2022)
  • 1.1.4 (Sep 30th, 2022)
    • Now any dbt profiles errors or connection issues using dbt commands will show a user-friendly message for dbt-impala adapter.  Added user-agent string to improve instrumentation
  • 1.1.3 (Sep 17th, 2022)
    • Adding support for append mode when partition_by clause is used. Along with an updated instrumentation package.

  • 1.1.2 (Aug 5th, 2022)
    • Now dbname in profile.yml file is optional; Updated a dependency in README; dbt-core version updates automatically in setup.py

  • 1.1.1 (Jul 16th, 2022)
    • Bug fixes for a specific function

 

  • 1.1.0 (Jun 9th, 2022)
    • Adapter migration to dbt-core-1.1.0; added time-out for snowplow endpoint to handle air-gapped env

  • 1.0.6 (May 23rd, 2022)
    • Added support to insert_overwrite mode for incremental models and added instrumentation to the adapter

  • 1.0.5 (Apr 29th, 2022)
    • Added support to an EXTERNAL clause with table materialization & improved error handling for relation macros

  • 1.0.4 (Apr 1st, 2022)
    • Added support to Kerberos authentication method and dbt-docs

  • 1.0.1 (Mar 25th, 2022)
    • Cloudera released the first version of the dbt-impala adapter

dbt-spark-cde adapter

  • 1.2.0 (Nov 14th, 2022)
  • 1.1.7 (Oct 28th, 2022)
  • 1.1.6 (Oct 18th, 2022)
    • Added way to switch on/off the SSL certificate verification for CDE endpoint. Along with updated instrumentation package.
  • 1.1.5 ( Oct 15th, 2022)

  • 1.1.4 (Sep 23rd, 2022)
    • During internal testing, we came across an issue where the second run of the incremental model was failing. We have fixed that issue.
    • For improved debugability, if a CDE job fails adapter will create a new log file inside the dbt log folder which contains the stderr output. A sample file looks like this: dbt-job-1663938116617-00000255.stderr.log

  • 1.1.3 (Sep 10th, 2022)
    • The detail for each query is now available in the logs:
      dbt.log
    • Spark CDE session parameters can be provided via dbt's profile.yml file as key: value pair
      hajmera_2-1664772415683.png

       

    • Any dbt profiles errors or connection issues using dbt commands will show a user-friendly message:
      hajmera_3-1664772415793.png

       

  • 1.1.2 (Sep 2nd, 2022)
    • Time out is added while polling job status to save resource hogging and code is clean and if enabled, spark events can also be seen with a new method

  • 1.1.1 (Aug 26th, 2022)
    • Improved debugging process to track JobId, Query, and session time. Access stderr and stdout of CDE jobs in dbt logs.

  • 1.1.0 (Jul 21st, 2022)
    • Cloudera released the first version of the dbt-spark-cde adapter that supports connection to Cloudera Data Engineering backend using CDE APIs

dbt-spark-livy adapter

  • 1.3.0 (Nov 21st, 2022)
  • 1.2.0 (Nov 9th, 2022)
  • 1.1.8 (Oct 28th, 2022)
  • 1.1.7 (Oct 18th, 2022)
    • Added way to switch on/off the SSL certificate verification for Livy endpoint. 

  • 1.1.6 (Oct 15th, 2022)

  • 1.1.5 (Sep 30th, 2022)
    • Added Kerberos support:
      hajmera_0-1665903234993.png
    • Along with an updated instrumentation package.

 

  • 1.1.4 (Sep 17th, 2022)
    • Now any dbt profiles errors or connection issues using dbt commands will show a user-friendly message for dbt-spark adapters
      hajmera_4-1664772416251.png
    • Spark session parameters can be provided via dbt's profile.yml file as key: value pair for dbt-spark adapters
      hajmera_5-1664772416206.pnghajmera_6-1664772416072.pnghajmera_7-1664772416117.png
  • 1.1.3 (Jul 29th, 2022)
    • Added instrumentation to the adapter and updated Setup.py as per upstream

  • 1.1.2 (Jul 1st, 2022)
    • Bug fixes to show an error when SQL model has some issue

  • 1.1.1 (Jul 1st, 2022)
    • Instructions for IDBroker Mappings in the ReadMe file and some minor changes to the setup and version files
       
  • 1.1.0 (Jun 17th, 2022)
    • Cloudera released the first version of the dbt-Spark-livy adapter to support Livy-based connection to the Cloudera Data Platform

Available resources:

Articles:

Bundled offering for CML  & CDSW deployment:

GitHub repository:

Python Packages: