Member since
07-11-2022
19
Posts
7
Kudos Received
0
Solutions
11-17-2022
12:00 AM
Overview Cloudera has implemented dbt adapters for Hive, Spark, and Impala. In addition to providing the adapters, Cloudera is also offering dbt capabilities to our on-prem customers (CDP PvC-Base). Our solution satisfies our customers' stringent security and privacy requirements while providing an easy-to-use turnkey solution for practitioners. In this document, we will show how users can run dbt on YARN without having to worry about anything else. We have packaged everything as a .tar.gz file. If you are new to dbt, refer to the article to learn more about dbt. Prerequisites The cluster should have Python 3.7+ The user should have the ability to ssh into a gateway machine The user should have access to the git repo with the dbt models and has the ability to create PRs in that git repo with changes. If you are creating your own repo as an analyst, refer to this Getting started with dbt Core The user SSH key can be configured for access to the git repo Compatibility >= CentOS 7 x86_x64 >= CDP 7.1.8 Setting up dbt on yarn We have created a dbt deployment package to simplify the setup process for dbt on yarn. We will only have to deploy this package on the cluster gateway machine. We will also make sure that we dont have a need to deploy packages locally on any of the worker nodes. Our deployment will take care of it too. Step 1. Login to your base cluster SSH to your gateway machine by running the following command: ssh <gateway machine> Sample terminal output Step 2. Download and install the dbt deployment package Download the cloudera-dbt-deployment package mkdir </path/to/working/dir> cd </path/to/working/dir> wget https://github.com/cloudera/cloudera-dbt-deployment/releases/download/Prerelease/cloudera-dbt-deployment-1.2.0.tar.gz Sample terminal output Create a python venv and install the package python3 -m venv <Virtual Environment Name> source <Virtual Environment Name>/bin/activate python3 -m pip install cloudera-dbt-deployment-1.2.0.tar.gz Sample terminal output Step 3: Download the python packages for all of the dbt runtime (for airgapped environments) We have packaged dbt core and all the dbt adapters we support into a single deployable package. This package will get updated on a regular basis whenever the underlying python packages are updated. We will need to download this package and upload to HDFS. command: wget https://github.com/cloudera/cloudera-dbt-deployment/releases/download/Prerelease/Centos7_x86_x64_dbt_dependencies.tar.gz hdfs dfs -copyFromLocal Centos7_x86_x64_dbt_dependencies.tar.gz hdfs://path/to/dependencies Sample terminal output At this point, we are ready to work with specific dbt projects. Setup yarn for a specific dbt project Step 1: Clone the dbt project We will use a sample dbt project from this repo. If you would like to start a new dbt project from scratch, refer to Getting started with dbt Core | dbt Developer Hub. Clone the sample project by running the following command: git clone https://github.com/cloudera/dbt-hive-example.git Sample terminal output Step 2: Create the yarn.env file Create environment variables for the dbt project in your working directory by running the following command. This file will be used by the yarn_dbt command. We need to create this file in the folder where the dbt project lives. cd </path/to/dbt-project> vi yarn.env Make sure the set the following variables in the file: DEPENDENCIES_PACKAGE_LOCATION YARN_JAR DBT_SERVICE_USER DBT_PROJECT_NAME YARN_RM_URI DBT_HEADLESS_KEYTAB DBT_HEADLESS_PRINCIPAL Sample: DEPENDENCIES_PACKAGE_PATH_HDFS=/tmp DEPENDENCIES_PACKAGE_NAME=Centos7_x86_x64_dbt_dependencies.tar.gz YARN_JAR=/opt/cloudera/parcels/CDH/lib/hadoop-yarn/hadoop-yarn-applications-distributedshell.jar DBT_SERVICE_USER=hive DBT_PROJECT_NAME=dbt_hive_demo YARN_RM_URI=http://hajmera-1.vpc.cloudera.com:8088 DBT_HEADLESS_KEYTAB=/cdep/keytabs/systest.keytab DBT_HEADLESS_PRINCIPAL=systest@VPC.CLOUDERA.COM CURRENT_DBT_USER=systest DBT_DOCS_PORT=7777 YARN_CONTAINER_MEMORY=2048 YARN_TIMEOUT=1200000 APPLICATION_TAGS="cia-user-ha-testing.dbt-debug" Refer to the table below to understand environment variables: Key Sample values Notes DEPENDENCIES_PACKAGE_PATH_HDFS /tmp Path in hdfs containing the tarball with all python packages needed for dbt for an offline install. Downloaded in previous section. DEPENDENCIES_PACKAGE_NAME e.g. Centos7_x86_x64_dbt_dependencies.tar.gz, e.g. Ubuntu18_x86_x64_dbt_dependencies.tar.gz Name of package in hdfs for offline install YARN_JAR e.g: /opt/cloudera/parcels/CDH/lib/hadoop-yarn/hadoop-yarn-applications-distributedshell.jar Distributed Shell Jar path for YARN job execution. This needs to be changed based on the CDP Base version. DBT_SERVICE_USER e.g: hive Service user with access to YARN resources. This user’s key tab is distributed through cloudera SCM and can be found in location /var/run/cloudera-scm-agent/process/ DBT_PROJECT_NAME e.g: dbt_hive_demo Project name DBT_HEADLESS_KEYTAB e.g: We use keytab for systest user found at path /cdep/keytabs/systest.keytab A headless key tab corresponding to a POSIX user that can start services without prompting for password. E.g. hdfs,hbase,... DBT_HEADLESS_PRINCIPAL e.g: systest@ROOT.HWX.SITE Kerberos Principal for above dbt headless keytab CURRENT_DBT_USER systest Logged in user in the session with valid keytab DBT_DOCS_PORT 7777 Port where dbt docs are hosted YARN_CONTAINER_MEMORY 2048 Memory allocation for YARN container in mb YARN_TIMEOUT 1200000 Time out for YARN container in milliseconds APPLICATION_TAGS cia-user-ha-testing.dbt-debug Prefix/identifier for YARN application, can be seen in YARN-UI Step 3: Create the dbt profiles.yml file We will be using the Kerberos method to connect to the query engines. So, the profiles.yml file should reflect it. Edit the profiles.yml file as per warehouse configurations by running the following command: cd /path/to/dbt-models vi profiles.yml Sample: dbt_hive_demo: outputs: dbt_hive_demo: auth_type: kerberos host: hajmera-1.vpc.cloudera.com port: 10000 schema: dbt_hive_demo threads: 2 type: hive use_http_transport: false use_ssl: false kerberos_service_name: hive target: dbt_hive_demo Step 4: Run kinit to get the authentication token Provide an authentication token to execute dbt by running the following command: kinit -kt </path/to/keytab/file> <username> Sample terminal output Work with the dbt project using yarn_dbt In the example repo, we have a sample dbt project called ‘dbt_hive_demo’. Inside this demo project, we can issue dbt commands to run parts of the project. The demo project contains examples of the following operations and commands supported with YARN deployment: Connection test: dbt debug Loading reference data: dbt seed Data validation: dbt test Running transformation: dbt run Accessing documentation: dbt docs More commands and a detailed description of each command can be found here. Note: yarn_dbt commands needs to be run in the same folder that yarn.env is located. Step 1. Test the connection to the warehouse To ensure we’ve configured our profile correctly, test the connection by running the following command: yarn_dbt debug Sample terminal output Step 2. Loading reference data Load the reference dataset to the warehouse by running the following command: yarn_dbt seed Sample terminal output Step 3. Validate reference data Our Seeds are configured with a couple of tests. Users can read more about it here We also have a custom test created in dbt_hive_demo/tests/generic/test_length.sql. This test is used to check the character length of a column. Our reference data includes ISO Alpha2 and Alpha3 country codes - we know these columns should always be 2 or 3, respectively. To ensure that our reference data is high quality, we can use dbt to test these assumptions and report the results. We expect that Alpha2 and Alpha3 columns are the correct lengths and that no fields should be null. Test the dataset by running the following command: yarn_dbt test Sample terminal output Step 4. Run the transformation We have 3 sets of models in this demo project. Firstly, we have raw. Our raw models make use of Sources. This is data already in our database that dbt needs to refer to. This is the fake data we loaded earlier. Our raw models are defined in models/raw/covid Next, we have staging. Our staging models use the source() method to refer to the Sources we defined in our raw models. The staging models are intermediate views created over our raw data to handle some basic type conversion. These are materialized as views, and we don't expect our end users to query the staging models. Our staging models are defined in models/staging/covid Lastly, we have mart. Our mart models use the ref() method to refer to the staging models and reference seeds we created using dbt. We use the staging views to handle most of the logic for typecasting, but we do some renaming here to make our models easier for users to understand. These models are materialized as tables, as this gives greater performance for user queries. We can use incremental models to make the building of the model more performant. Our mart models are defined in models/mart/covid Execute all the transformations by running the following command: yarn_dbt run Sample terminal output Step 5. Host and access the documentation To generate and host the documentation on the gateway machine, run the following command: yarn_dbt docs Sample terminal output To access the documentation, copy the Application ID from the above output. and run the following command: yarn logs -applicationId <application-id> | grep 7777 Sample terminal output Copy the host address and add the :7777 port to it. Paste the following URL to your browser: Sample URL: hajmera-3.vpc.cloudera.com:7777 Hosted documentation looks like this: Accessing logs in the yarn UI In order to debug issues further, you can view the logs in the yarn UI. Find the yarn application id from the terminal Go to YARN-UI for the cluster, URL looks similar to: Click on the Application tab and locate the ApplicationID: Click on Application ID and click on Logs: Schedule dbt models dbt models can be scheduled using any workflow management platform like Apache Airflow or Apache Oozie. Conclusion In this document, we have shown the different requirements that need to be met to support the full software development life cycle of dbt models. The table below shows how those requirements have been met. Requirement Will this option satisfy the requirement? If yes, how? Have a dev setup where different users can do the following (in an isolated way): Make changes to models Yes, per user can log into gateway machine, having checked out their own branch of the given dbt project codebase. Test changes Yes See logs of tests Yes Update docs in the models and see docs Yes Have a CI/CD pipeline to push committed changes in the git repo to stage/prod environments Yes, simple git-push and delegating to external CI/CD system See logs in stage/prod of the dbt runs Yes See dbt docs in stage/prod Yes Support isolation across different users using it in dev Yes, each session is isolated and use their own login credentials Be able to upgrade the adapters or core seamlessly Cloudera will publish new github assets, under github releases. Ability to run dbt run regularly to update the models in the warehouse Yes, via any scheduler like Apache Airflow Facing issues while working with dbt? We have troubleshooting guide, to help you resolve issues. You can also reach out to innovation-feedback@cloudera.com if you have any questions.
... View more
11-08-2022
03:12 PM
Introduction Cloudera recently announced the open-source dbt adapters for all the engines in Cloudera Data Platform (CDP)—, Apache Impala, and Apache Spark, with added support for Apache Livy and Cloudera Data Engineering. In addition to providing the adapters, Cloudera is offering a turn-key solution to be able to manage the end-to-end software development life cycle (SDLC) of dbt models. This solution is available on all clouds, as well as on-prem deployments of CDP. In this article, we show how our customer data teams can streamline their data transformation pipelines in the Cloudera Data Platform and deliver high-quality data that their business can trust. Our solution satisfies the stringent security and privacy requirements of our customers while providing an easy-to-use turnkey solution for practitioners. dbt’s end-to-end software development life cycle A key advantage of using dbt is that it provides a framework for analysts to easily follow software engineering best practices for their SQL transformation pipelines. Instead of the typical ad hoc scripting resulting in brittle pipelines, analysts can leverage engineering best practices to build robust, tested, and documented pipelines that produce high-quality data sets that can be trusted by the business. Figure 1: Software development life cycle of dbt models As shown in Figure 1, a dbt user’s workflow typically consists of the following phases: Develop - Multiple analysts can independently clone the project to modify or write new SQL models and push changes back to the main branch. Test - While making changes to models, analysts can include tests to validate data quality. These tests can also be run in production. Deploy - dbt encourages a gitops flow for deployment. So, when a change needs to be deployed, it is first expected to be committed into git. This process acts as a forcing function that simplifies setting up a CI/CD pipeline that automatically deploys changes that are committed into git. Operate - Once deployed, operating dbt in production includes orchestration and viewing documentation. Monitor/debug - One of the key aspects of operating dbt in production is monitoring for failures, data quality test failures, and debugging any issues. The logs generated by dbt have a lot of this information and are consumable even by analysts who are not super technical. What it takes to provide an end-to-end solution for dbt In order for any customer to use dbt core and the adapters to build their transformation pipelines, a lot of scaffolding needs to be available. Cloudera has identified the requirements of such a scaffolding to enable secure and simple workflows for analysts and has provided guides to bring up such a scaffolding natively within the Cloudera Data Platform. Multiple Environments dbt makes it easy to maintain separate production and development environments through the use of targets within a profile. So, any deployment of dbt needs to support multiple environments, to support the different steps in the software development lifecycle, that are isolated from each other. For example, Dev environment to be used by analysts to edit and test their models and documentation Stage environment to be used for automated testing of committed model and documentation changes Prod environment to be used to build and run models to generate production data sets. Consumers of the models will typically access these production data sets and corresponding documentation Isolated development workspaces Different analysts should be able to make changes and test models without affecting the work being performed by other analysts. Also, different analysts may have access to different models. So, analysts should be able to only make changes to models that they have access to. CI/CD pipeline Any deployment of dbt should allow for changes made to models and documentation to be automatically tested and promoted to production environments. This automation requires the availability of a system to manage workflows. Orchestration Typically models need to be refreshed or updated on a regular basis whenever the underlying source data is updated. So, any deployment of dbt should have a mechanism to run a dbt model refresh or update on a schedule, or based on events like Kafka. Easy access to documentation dbt provides a way to generate documentation for a dbt project and render it as a website. Documentation of dbt models helps downstream consumers discover and understand the datasets which are curated for them. Though this documentation can be accessed locally, it can also be hosted remotely and can be accessible to others on the team. Any deployment of dbt should have a way to access this documentation. Web-based UI to develop, & deploy in one place In order for analysts to build self-serving data pipelines, it’s necessary that development, testing, and deployment of SQL models via a CI/CD pipeline can be done from a single interface, without any dependencies on different teams or tools. Any deployment of dbt should offer a single application experience to the analysts. Easy access to logs dbt generates logs that are helpful in debugging issues in models as well as investigating performance problems. Any deployment of dbt should allow for these logs to be readily accessible to analysts without requiring any complicated setup. Monitoring & alerting Any dbt deployment should have monitoring and alerting capabilities for dbt jobs. This lets the IT team know if there are any issues or job failures. Managed software artifacts dbt Core is an open-source project. It is updated from time to time with new features and performance improvements. In addition, Common Vulnerabilities and Exposures (CVE) need to be fixed. Any dbt deployment should offer a seamless way to upgrade core and adapters without having to manage software artifact versions. Cloudera solution: Cloudera has provided a managed software package of dbt core and all adapters for CDP engines that is maintained and supported by Cloudera. Watch dbt working in CDP Public Cloud. Demo video: dbt on Cloudera Data Platform The dbt integration with CDP is brought to you by Cloudera’s Innovation Accelerator, a cross-functional team that identifies new industry trends and creates new products and partnerships that dramatically improve the lives of our Cloudera customers’ data practitioners. Learn more with Cloudera’s simple guides to deploy and run dbt in all form factors supported by Cloudera for a truly hybrid solution. CDP Public Cloud via Cloudera Machine Learning CDP Private Cloud via Cloudera Data Science Workbench To learn more, contact us at innovation-feedback@cloudera.com.
... View more
11-02-2022
01:35 PM
This article lists issues that may arise while using dbt adapters and their resolution steps. Issue: Error recreating a dbt model in Impala which was originally created using dbt Hive Issue: Permission issue for table access Issue: Spark2 read managed tables issue via dbt Issue: SSL certificate failure or Invalid certificate error Issue: Permission issue while creating table Issue: Latency while executing dbt commands with dbt-spark-cde adapter Issue: Missing yarn.env file in current working directory Issue: Command '['tar', '-zcf', '/tmp/dbt-workspace.tar.gz', 'xxx']' returned non-zero exit status 2 Issue: Troubleshooting Connection with Warehouse Please reach out to us at innovation-feedback@cloudera.com if you see any issues not listed here. Issue: Error recreating a dbt model in Impala which was originally created using dbt Hive Description: Recreating a dbt model in Impala which was originally created using dbt Hive produces an error similar to MetaException: Cannot change stats state for a transactional table <table-name> without providing the transactional write state for verification (new write ID 1, valid write IDs null; current state {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}; new state {"BASIC_STATS":"true"} Resolution: Solved this issue by setting the following two properties to false: set hive.stats.autogather=false set hive.txn.stats.enabled=false In the Hive services and restarting. Follow the guide here on how to set properties for the hive service: Setting Hive Configuration Overrides Screenshot (of how the properties are set for the Hive service) Additional Reference: org.apache.hadoop.hive.ql.exec.StatsTask. MetaException(message:Cannot change stats state for a transactional table Issue: Permission issue for table access Description: While accessing Managed extended table you may see the error: org.apache.spark.sql.AnalysisException: Spark has no access to table `XXX`.`XXXXX`. Clients can access this table only if they have the following capabilities: CONNECTORREAD,HIVEMANAGEDINSERTREAD,HIVEMANAGEDINSERTWRITE,HIVEMANAGESTATS,HIVECACHEINVALIDATE,CONNECTORWRITE. This table may be a Hive-managed ACID table, or require some other capability that Spark currently does not implement; at org.apache.spark.sql.catalyst.catalog.CatalogUtils$.throwIfNoAccess(ExternalCatalogUtils.scala:280) at org.apache.spark.sql.hive.HiveTranslationLayerCheck$SanonfunSapply$1.apply0rElse(HiveTranslationLayerStrategies.scala:109) at org.apache.spark.sql.hive.HiveTranslationLayerCheck$SanonfunSapply$1.apply0rElse(HiveTranslationLayerStrategies.scala:85) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$SanonfunSresolveOperatorsDown$1$Sanonfun$2.apply(AnalysisHelper.scala:108) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$SanonfunSresolveOperatorsDown$1$Sanonfun$2.apply(AnalysisHelper.scala:108) at org.apache.spark.sql.catalyst.trees.CurrentOriginS.withOrigin(TreeNode.scala:72) at org.apache.spark.sql.catalyst.plans. logical.AnalysisHelper$SanonfunSresolveOperatorsDown$1.apply(AnalysisHelper.scala:107) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$SanonfunSresolveOperatorsDown$1.apply(AnalysisHelper.scala:106) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelperSclass.resolveOperatorsDown(AnalysisHelper.scala:106) at org.apache. spark. sql. catalyst.plans. logical.LogicalPlan. resolveOperatorsDown(LogicalPlan.scala: 29) at org.apache.spark.sql.catalyst.plans. logical.AnalysisHelper$SanonfunSresolveOperatorsDown$1$SanonfunSapply$6.apply(AnalysisHelper.scala:113) Resolution: To solve this issue, we need the following configurations. Property Example Value Description spark.sql.extensions com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension Extension needed for auto-translate to work spark.kryo.registrator com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator For using kryo serialization. spark.sql.hive.hwc.execution.mode spark spark.driver.extraClassPath local:/opt/cloudera/parcels/CDH/lib/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.7.1.8.0-801.jar to use HWC(Hive Warehouse Connector) jars. spark.executor.extraClassPath local:/opt/cloudera/parcels/CDH/lib/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.7.1.8.0-801.jar The hive Warehouse connector jar should be supplied along with the configs to be able to access managed tables. You can do the above by either: Specify properties in the spark-defaults.conf file in the form of property value. passing in config values through a spark-session Additional Reference: Introduction to HWC | CDP Private Cloud Issue: Spark2 read managed tables issue via dbt Description: If a managed table is created in Hive and is accessed with Spark-livy, it may throw permission errors shown below: 12:48:21 Completed with 1 error and 0 warnings: 12:48:21 12:48:21 Runtime Error in model test2 (models/staging/test2.sql) 12:48:21 Database Error 12:48:21 Error while executing query: 12:48:21 Spark has no access to table `certified_data`.`test_hive_streaming2`. Clients can access this table only if 12:48:21 they have the following capabilities: CONNECTORREAD,HIVEFULLACIDREAD,HIVEFULLACIDWRITE,HIVEMANAGESTATS,HIVECACHEINVALIDATE,CONNECTORWRITE. 12:48:21 This table may be a Hive-managed ACID table, or require some other capability that Spark 12:48:21 currently does not implement; 12:48:21 12:48:21 Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1 Resolution: This issue can be resolved by adding the properties in spark configuration: spark.kryo.registrator=com.qubole.spark.hiveacid.util.HiveAcidKyroRegistrator spark.sql.extensions=com.qubole.spark.hiveacid.HiveAcidAutoConvertExtension spark.sql.hive.hwc.execution.mode=spark spark.driver.extraClassPath=local:/opt/cloudera/parcels/CDH/lib/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.7.1.8.0-801.jar spark.executor.extraClassPath=local:/opt/cloudera/parcels/CDH/lib/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.7.1.8.0-801.jar Additional Reference: Introduction to HWC | CDP Private Cloud Issue: SSL certificate failure or Invalid certificate error Description: While using the spark-cde and spark-livy adapter with a warehouse that uses a self-sign certificate, while running dbt debug and other dbt commands you may get an error as shown below : certificate verify failed: self-signed certificate Resolution: To disable SSL certificate verification, in the profile file for the adapter you can set the Additional variable as shown below: verify_ssl_certificate: false Issue: Permission issue while creating table Description: dbt user while running dbt command may see the error as shown below, this error occurs when user doesn't have CREATE TABLE permissions: dbt run 11:57:43 Running with dbt=1.1.0 11:57:43 Found 1 model, 0 tests, 0 snapshots, 0 analyses, 208 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics 11:57:43 11:59:02 Encountered an error: Runtime Error Runtime Error Database Error Error while executing query: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Permission denied: user [airflow.bigdata] does not have [WRITE] privilege on [[hdfs://dlhdfs/warehouse/tablespace/external/hive/test_dbt.db, hdfs://dlhdfs/warehouse/tablespace/external/hive/test_dbt.db/]]); Resolution: Check the dbt logs to understand what permissions are needed for different tables, and get those permissions to the user. Additional Reference: What privileges does my database user need to use dbt? | dbt Developer Hub Issue: Latency while executing dbt commands with dbt-spark-cde adapter Description: While running dbt commands using a dbt-spark-cde adapter, we have observed unpredictable latency while completing the execution of dbt commands. Reason: We did a bunch of work on improving the performance of the cde adapter and have added instrumentation to understand where the latencies are. There are several factors that cause latencies in CDE Load on cluster impacts when the spark job is picked up/scheduled for run Use of polling mechanisms in intervals to check job status (success/failure) after a job is submitted. Use of a sleep interval(40 secs currently) to check the output of a job run after job success/failure. This is where CDE aggregates logs and results. No feature in CDE to have a long-running spark session. Each single spark SQL query from dbt creates a new spark session. Running any SQL query in CDE has the following steps: Generate a job name. Create a resource Upload the resource. Submit the job Run the job Fetch the job results from log Clean the resources Explanation with an example: Taking GitHub - cloudera/dbt-spark-cde-example as an example dbt project, here’s the breakdown of the CDE’s steps for the two most common dbt commands: dbt debug This command executes the following SQL query: select 1 as id Breakdown as per CDE’s steps: Step name Time taken (seconds) Log 1 Generate a job name. 0 07:00:46.778564 Job created with id: dbt-job-1667545246778-00000557 for SQL statement: select 1 as id 2 Create a resource 1 07:00:46.779102 Create resources: files 07:00:47.177818 Done create resource: files 3 Upload the resource. 0 07:00:47.178924 Upload resource: SQL resource: dbt-job-1667545246778-00000557-1667545247179.sql 07:00:47.598818 Done upload resources: SQL resource: dbt-job-1667545246778-00000557-1667545247179.sql 07:00:47.599993 Upload resource: py resource: dbt-job-1667545246778-00000557-1667545247179.py 07:00:47.928185 Done upload resource: py resource: dbt-job-1667545246778-00000557-1667545247179.py 4 Submit the job 1 07:00:47.929512 Submit job 07:00:48.346363 Done submit job 5 Run the job (and wait for the job to change state from start to succeed/fail) 62 07:00:48.347828 Run job 07:00:49.663841 Done run job 07:00:49.664948 Get job status 07:00:49.941149 Current Job status: starting 07:00:49.942419 Done get job status 07:00:49.942628 Sleep for 30 seconds 07:01:19.945953 Done sleep for 30 seconds 07:01:19.946701 Get job status 07:01:20.244582 Current Job status: starting 07:01:20.246297 Done get job status 07:01:20.246591 Sleep for 30 seconds 07:01:50.250127 Done sleep for 30 seconds 07:01:50.251736 Get job status 07:01:50.544115 Current Job status: succeeded 07:01:50.545305 Done get job status 6 Fetch the job results from log (after waiting for the job logs to be aggregated) 82 07:01:50.545694 Get job output 07:01:50.546037 Sleep for 40 seconds 07:02:30.547689 Done sleep for 40 seconds 07:02:31.098306 Done get job output 07:02:31.121608 Get spark events 07:02:31.139144 Log spark job events 07:02:31.157137 Sleep for 40 seconds 07:03:11.160204 Done sleep for 40 seconds 07:03:12.058962 Done log spark job events 07:03:12.059515 SparkListenerBlockManagerAdded 07:01:10.347000 07:03:12.059692 SparkListenerApplicationStart 07:01:04.568000 07:03:12.059858 SparkListenerExecutorAdded 07:01:14.476000 07:03:12.060023 SparkListenerBlockManagerAdded 07:01:14.587000 07:03:12.060748 Done get spark events 7 Clean the resources 0 07:03:12.060899 Delete job 07:03:12.427473 Done delete job 07:03:12.428597 Delete resource 07:03:12.932400 Done delete resource Total 146 NOTE: Typically, dbt run executes all the models inside the dbt project. For sake of simplicity, we have only executed a single model (i.e. /models/staging/covid/stg_covid__cases.sql) dbt run We will consider running dbt run on a single model by running the command dbt run --select /models/staging/covid. This command executes the following SQL statements, as dbt-core tries to build a DAG before it starts to build the model. The first 5 statements are the “fixed cost” for any dbt run. If there are multiple models, there will only be one invocation of the first 5 statements. The latency of the dbt run command is then determined by the number of models and the time taken for each model. 1. show databases 2. show table extended in spark_cde_demo_reference like '*' 3. show table extended in spark_cde_demo_mart_covid like'*' 4. show table extended in spark_cde_demo like '*' 5. show table extended in spark_cde_demo_staging_covid like '*' 6. create or replace view spark_cde_demo_staging_covid.stg_covid__cases as select cast(cast(unix_timestamp(date_rep,"dd/MM/yyyy") as timestamp) AS DATE) as date_rep, cast(cases as BIGINT) as cases, cast(deaths as BIGINT) as deaths, geo_id from spark_cde_demo.raw_covid_cases For each statement, the breakdown as per CDE’s steps can be obtained from the dbt logs. Let’s consider the execution of the last query in CDE, the breakdown as per CDE’s steps: Step name Time taken (seconds) Log 1 Generate a job name. 0 07:27:39.914554 Job created with id: dbt-job-1667546859913-00000083 for SQL statement: /* {"app": "dbt", "dbt_version": "1.1.2", "profile_name": "dbt_spark_cde_demo", "target_name": "cia_dbt_spark_cde_demo", "node_id": "model.spark_cde_demo.stg_covid__cases"} */ create or replace view spark_cde_demo_staging_covid.stg_covid__cases as select cast(cast(unix_timestamp(date_rep,"dd/MM/yyyy") as timestamp) AS DATE) as date_rep, cast(cases as BIGINT) as cases, cast(deaths as BIGINT) as deaths, geo_id from spark_cde_demo.raw_covid_cases 2 Create a resource 1 07:27:39.915082 Create resources: files 07:27:40.212270 Done create resource: files 3 Upload the resource. 0 07:27:40.213107 Upload resource: SQL resource: dbt-job-1667546859913-00000083-1667546860213.sql 07:27:40.559058 Done upload resources: SQL resource: dbt-job-1667546859913-00000083-1667546860213.sql 07:27:40.560169 Upload resource: py resource: dbt-job-1667546859913-00000083-1667546860213.py 07:27:40.913187 Done upload resource: py resource: dbt-job-1667546859913-00000083-1667546860213.py 4 Submit the job 1 07:27:40.913892 Submit job 07:27:41.222846 Done submit job 5 Run the job (and wait for the job to change state from start to succeed/fail) 62 07:27:41.223755 Run job 07:27:42.979183 Done run job 07:27:42.980947 Get job status 07:27:43.251816 Current Job status: starting 07:27:43.252802 Done get job status 07:27:43.253016 Sleep for 30 seconds 07:28:13.256820 Done sleep for 30 seconds 07:28:13.257510 Get job status 07:28:13.564024 Current Job status: starting 07:28:13.564966 Done get job status 07:28:13.565149 Sleep for 30 seconds 07:28:43.570355 Done sleep for 30 seconds 07:28:43.570959 Get job status 07:28:43.847729 Current Job status: succeeded 6 Fetch the job results from log (after waiting for the job logs to be aggregated) 82 07:28:43.849548 Get job output 07:28:43.849723 Sleep for 40 seconds 07:29:23.855103 Done sleep for 40 seconds 07:29:24.259266 Done get job output 07:29:24.259961 Get spark events 07:29:24.260120 Log spark job events 07:29:24.260284 Sleep for 40 seconds 07:30:04.262971 Done sleep for 40 seconds 07:30:05.093419 Done log spark job events 07:30:05.093951 SparkListenerBlockManagerAdded 07:28:03.787000 07:30:05.094159 SparkListenerApplicationStart 07:27:58.196000 07:30:05.094355 SparkListenerExecutorAdded 07:28:08.036000 07:30:05.094548 SparkListenerBlockManagerAdded 07:28:08.138000 07:30:05.095971 SparkListenerApplicationEnd 07:28:11.648000 07:30:05.096198 Done get spark events 7 Clean the resources 0 07:30:05.096386 Delete job 07:30:05.383030 Done delete job 07:30:05.384246 Delete resource 07:30:05.696494 Done delete resource Total 146 Conclusion: Since each query follows the aforementioned 7 steps, to execute each query in the above example CDE will take similar time. This means to execute a single dbt model on average it takes 6 * 146(average time to execute each query) = 876 seconds or 14.6 minutes If the dbt model is complex or if multiple dbt models are involved in a single dbt command this latency increases accordingly. Issue: Missing yarn.env file in current working directory Description: While running dbt commands on yarn, we have seen this error where commands are not issued from correct directory which is containing yarn.env file. Assuming your dbt project directory path is ~/my-project-dir/sample-dbt-project . Your working directory to issue yarn commands is ~/my-project-dir/ Resolution: Create your yarn.env file in the location ~/my-project-dir Issue: Command '['tar', '-zcf', '/tmp/dbt-workspace.tar.gz', 'xxx']' returned non-zero exit status 2 Description: While running dbt commands on yarn, we have seen this error where commands are not issued from correct directory which is containing yarn.env file. Assuming your dbt project directory path is ~/my-project-dir/sample-dbt-project . Your working directory to issue yarn commands is ~/my-project-dir/ Resolution: Check the current working directory where you are running the command. It should be immediately outside of the dbt project . In this case ~/my-project-dir/ Check the DBT_PROJECT_NAME variable in yarn.env. In this case it should be DBT_PROJECT_NAME=sample-dbt-project and not xxx Issue: Troubleshooting Connection with Warehouse Description: Usually dbt fails to connect to warehouse in case of wrong values in profiles.yml A sample profiles file(profiles.yml) to connect to hive warehouse is: dbt_hive_demo: outputs: dev: auth_type: kerberos host: tsenapati-3.tsenapati.root.hwx.site port: 10000 schema: dbt_hive_demo threads: 4 type: hive use_http_transport: false use_ssl: false kerberos_service_name: hive target: dev Resolution: If connection fails try troubleshooting without yarn/dbt involved. Follow the steps below: Installing the package: pip install impyla Here’s a sample script to test connection to hive warehouse for profiles file above, Create a test file connectiontest.py with following content: from impala.dbapi import connect conn = connect( host = "host.name", port = portnumber, auth_mechanism = "GSSAPI", use_ssl = False, use_http_transport = False, kerberos_service_name="hive") cursor = conn.cursor() cursor.execute("SHOW DATABASES") res = cursor.fetchall() print(res) cursor.close() conn.close() Modify values with actual values from your host and run the command: python3 connectiontest.py If the above script succeeds and shows sample databases, then your connection parameters are correct. Populate the same in profiles.yml accordingly.
... View more
09-15-2022
04:35 PM
Overview Cloudera has implemented dbt adapters for Hive, Spark (CDE and Livy), and Impala. In addition to providing the adapters, Cloudera is offering a turn-key solution to be able to manage the end-to-end software development life cycle (SDLC) of dbt models. This solution is available in all clouds as well as on-prem deployments of CDP. It is useful for customers who would not like to use dbt Cloud for security reasons or the lack of availability of adapters in dbt cloud. We have identified the following requirements for any solution that supports the end-to-end SDLC of data transformation pipelines using dbt. Have multiple environments Dev Stage/Test Prod Have a dev setup where different users can do the following in an isolated way: Make changes to models Test changes See logs of tests Update docs in the models and see docs Have a CI/CD pipeline to push committed changes in the git repo to stage/prod environments See logs in stage/prod of the dbt runs See dbt docs in stage/prod Orchestration: Ability to run dbt run regularly to update the models in the warehouse, or based on events(Kafka) Everything should be part of one application(tool) like CDP or CDSW Alerting and Monitoring, if there is a failure how IT team will know that Any deployment for dbt should also satisfy the following Convenient for analysts - no terminal/shells/installing software on a laptop. Should be able to use a browser. Support isolation across different users using it in dev Support isolation between different environments (dev/stage/prod) Secure login - SAML Be able to upgrade the adapters or core seamlessly Vulnerability scans and fixing CVEs Able to add and remove users for dbt - Admin privilege Cloudera Data Platform has a service, CDSW, which offers users the ability to build and manage all of their machine learning workloads. The same capabilities of CDSW can also be used to satisfy the requirements for the end-to-end SDLC of dbt models. In this document, we will show how an admin can set up the different capabilities in CDSW like workspaces, projects, sessions, and runtime catalogs so that an analyst can work with their dbt models without having to worry about anything else. First, we will show how an admin can set up CDSW runtime catalog within a workspace, with the Cloudera provided container with dbt-core and all adapters supported by Cloudera CDSW project for stage/prod (i.e., automated/non-development) environments. Analysts create their own projects for their development work. CDSW jobs to run the following commands in an automated way on a regular basis git clone dbt debug dbt run dbt doc generate CDSW apps to serve model documentation in stage/prod Next, we will show how an analyst can build, test, and merge changes to dbt models by using CDSW project to work in isolation without being affected by other users CDSW user sessions - for interactive IDE of dbt models git to get the changes reviewed and pushed to production Finally, we will show how by using CDSW all of the requirements for the end-to-end software development lifecycle of dbt models. Administrator steps Prerequisites A CDSW environment is required to deploy dbt with CDSW, refer Installing Cloudera Data Science Workbench on CDP to create an environment Administrator should have access to the CDP-Base Control Plane and admin permissions to CDSW Access to a git repository with basic dbt scaffolding (using proxies if needed). If such a repository does not exist, follow the steps in Getting started with dbt Core Access to custom runtime catalog (using proxies if needed) Machine user credentials - user/pass or Kerberos - for stage and production environments. See CDP machine user on creating machine users for hive/impala/spark. Note: The document details a simple setup within CDSW where we will use one workspace for dbt for all of the dev/stage/prod environments. use one project each for stage/prod and one per user to provide access isolation use one session per user/analyst for their development, testing, and to push PRs Step 1. Create and enable a custom runtime in CDSW with dbt In the workspace screen, click on “Runtime Catalog” to create a custom runtime with dbt. Step 1.1. Create a new runtime environment Select the Runtime Catalog from the side menu, and click Add Runtime button: Use the following URL for Docker Image to Upload: http://public.ecr.aws/d7w2o6p0/dbt-cdsw:1.1.11 Click on Validate. When validation succeeds, click on “Add to Catalog”. The new runtime will show up in the list of runtimes Step 1.2. Set runtime as default for all new sessions In the workspace’s side menu, select Site Administration and scroll down to the Engine Images section. Add a new Engine image by adding the following values Field Value Description Description dbt-cdsw Respository:Tag http://public.ecr.aws/d7w2o6p0/dbt-cdsw:1.1.11 Find most updated docker image here Default Enable Make this runtime the default for all new sessions in this workspace Step 2. Set up projects for stage and prod (automated) environments Admins create projects for stage and prod (and other automated) environments. Analysts can create their own projects. Creating a new project for stage/prod requires the following steps: Create a CDSW project Set up environment variables for credentials and scripts Step 2.1. Create a CDSW project From the workspace screen, click on Add Project Fill out the basic information for the project Field Value Notes Project Name prod-marketing Name of the dbt project running in stage/prod Project Description Project Visibility Private Recommend private for prod and stage Initial Setup Blank We will set up git repos separately via CDSW Jobs later in prod/stage. Click the Create Project button on the bottom right corner of the screen Step 2.2. Set environment variables to be used in automation To avoid checking profile parameters (users credentials) to git, the user SSH key can be configured for access to the git repo (How to work with GitHub repositories in CML/CDSW - Cloudera Community - 303205) Click Project Settings from the side menu on the project home page and click on Advanced tab Enter the environment variables. Click on to add more environment variables. Key Value Notes DBT_GIT_REPO https://github.com/cloudera/dbt-impala-example.git Repository that has the dbt models and profiles.yml DBT_IMPALA_HOST DBT_IMPALA_HTTP_PATH DBT_IMPALA_USER DBT_IMPALA_PASSWORD DBT_IMPALA_DBNAME DBT_IMPALA_SCHEMA DBT_SPARK_CDE_HOST DBT_SPARK_CDE_AUTH_ENDPOINT DBT_SPARK_CDE_PASSWORD DBT_SPARK_CDE_USER DBT_SPARK_CDE_SCHEMA DBT_SPARK_LIVY_HOST DBT_SPARK_LIVY_USER DBT_SPARK_LIVY_PASSWORD DBT_SPARK_LIVY_DBNAME DBT_SPARK_LIVY_SCHEMA DBT_HIVE_HOST DBT_HIVE_HTTP_PATH DBT_HIVE_USER DBT_HIVE_PASSWORD DBT_HIVE_SCHEMA Adapter specific configs passed as environment variables DBT_HOME Path to home directory Note: There could be different environment variables that need to be set depending on the specific engine and access methods like Kerberos or LDAP. Refer to the engine-specific adapter documentation to get the full list of parameters in the credentials. Environment variables will look as shown below: Click the Submit button on the right side of the section Note: You will have to use the credential environment variables in the profiles.yml file in the dbt project that is checked into DBT_GIT_REPO. So, the profiles.yml would look like below: dbt_impala_demo: outputs: dev_cia_cdp: type: impala host: "{{ env_var('DBT_IMPALA_HOST') }}" http_path: "{{ env_var('DBT_IMPALA_HTTP_PATH') }}" port: 443 auth_type: ldap use_http_transport: true use_ssl: true username: "{{ env_var('DBT_IMPALA_USER') }}" password: "{{ env_var('DBT_IMPALA_PASSWORD') }}" dbname: "{{ env_var('DBT_IMPALA_DBNAME') }}" schema: "{{ env_var('DBT_IMPALA_SCHEMA') }}" target: dev_cia_cdp Note: Environment variables are really flexible. You can use them for any field in the profiles.yml jaffle_shop: target: dev outputs: dev: type: "{{ env_var('DBT_ENGINE_TYPE') }}" host:"{{ env_var('DBT_ENGINE_HOST') }}" user: "{{ env_var('DBT_USER') }}" password: "{{ env_var('DBT_PASSWORD') }}" port: "{{ env_var('DBT_ENGINE_PORT') }}" dbname: "{{ env_var('DBT_DBNAME') }}" schema: "{{ env_var('DBT_SCHEMA') }}" threads: "{{ env_var('DBT_THREADS') }}" In Step 3.3. Setup dbt debug job, you will be able to test and make sure that the credentials provided to the warehouse are accurate. Step 3. Create jobs and pipeline for stage/prod CDSW jobs will be created for the following jobs to be run in order as a pipeline to be run on a regular basis whenever there is a change pushed to the dbt models repository. Get the scripts for the different jobs git clone/pull dbt debug dbt run dbt docs generate All the scripts for the jobs are available in the custom runtime that is provided. These scripts rely on the project environment variables that have been created in the previous section. Step 3.1 Authenticate Before starting a session you may need to authenticate, the steps may vary based on the authentication mechanism. Go to User Settings from the left menu: If your instance uses the Kerberos mechanism, select Hadoop Authentication and fill in the Principal and Credentials and click Authenticate. Step 3.2 Setup scripts location Scripts are present under the /scripts folder as part of the dbt custom runtime. However, the CDSW jobs file interface only lists the files under the home directory (/home/cdsw). Create a session with the custom runtime: and from the terminal command line, copy the scripts to the home folder. cp -r /scripts /home/cdsw/ Step 3.3 Setup git clone job Create a new job for git clone and select the job script from the scripts folder updated in Step 3.2 Update the arguments and environment variables, and create the job. Field Name Value Comment Name job-git-clone Script scripts/job-git-clone.py This is the script that would be executed in this job step. Arguments /home/cdsw/dbt-impala-example/dbt_impala_demo Path of the dbt project file, which is part of the repo. Editor Workbench Kernel Python 3.9 Edition dbt cdsw custom runtime Version 1.1 Runtime Image http://public.ecr.aws/d7w2o6p0/dbt-cdsw:1.1.11 Find most updated docker image here Schedule Recurring; Every hour This can be configured as either Manual/Recurring or Dependent Use a cron expression Check, 0**** Default value Resource profile 1vCPU/2GiB Timeout In Minutes - Optional timeout for the job Environment Variables These can be used to overwrite settings passed at project level (Section 3.2) Job Report Recipients Recipients to be notified on job status Attachments Attachments if any Step 3.3. Setup dbt debug job Field Name Value Comment Name job-dbt-debug Script scripts/job-dbt-debug.py This is the script that would be executed in this job step. Arguments /home/cdsw/dbt-impala-example/dbt_impala_demo Other command line arguments to dbt Path of the dbt project file, which is part of the repo. Ex: –profiles-dir Editor Workbench Kernel Python 3.9 Edition dbt custom runtime Version 1.1 Runtime Image http://public.ecr.aws/d7w2o6p0/dbt-cdsw:1.1.11 Find most updated docker image here Schedulable Dependent Make sure that this job runs only after cloning/updating the git repo. job-git-clone Job-dbt-debug is dependent on job-git-clone and will run only after it completes. Resource profile 1vCPU/2GiB Timeout In Minutes - Optional timeout for the job Environment Variables These can be used to overwrite settings passed at the project level (Section 3.2) Job Report Recipients Recipients to be notified on job status Attachments Attachments if any Step 3.4. Setup dbt run job Field Name Value Comment Name job-dbt-run Script scripts/job-dbt-run.py This is the script which would be executed in this job step. Arguments /home/cdsw/dbt-impala-example/dbt_impala_demo Other command line arguments to dbt Path of the dbt project file, which is part of the repo. Ex: –profiles-dir Editor Workbench Kernel Python 3.9 Edition dbt custom runtime Version 1.1 Runtime Image http://public.ecr.aws/d7w2o6p0/dbt-cdsw:1.1.11 Find most updated docker image here Schedulable Dependent Make sure that this job depends on dbt-debug job. job-dbt-debug Job-dbt-run is dependent on job-dbt-debug, and will run only after it completes. Resource profile 1vCPU/2GiB Timeout In Minutes - Optional timeout for the job Environment Variables These can be used to overwrite settings passed at project level (Section 3.2) Job Report Recipients Recipients to be notified on job status Attachments Attachments if any Step 3.5. Setup dbt docs generate job Field Name Value Comment Name dbt-docs-generate Script scripts/dbt-docs-generate.py This is the script which would be executed in this job step. Arguments /home/cdsw/dbt-impala-example/dbt_impala_demo Path of the dbt project file, which is part of the repo. Editor Workbench Kernel Python 3.9 Edition dbt custom runtime Version 1.1 Runtime Image http://public.ecr.aws/d7w2o6p0/dbt-cdsw:1.1.11 Find most updated docker image here Schedulable Dependent Generate docs only after the models have been updated. dbt-docs-generate dbt-docs-generate is dependent on job-dbt-run, and will run only after it completes. Resource profile 1vCPU/2GiB Timeout In Minutes - Optional timeout for the job Environment Variables These can be used to overwrite settings passed at project level (Section 3.2) Job Report Recipients Recipients to be notified on job status Attachments Attachments if any After following the 4 steps above, there will be a pipeline with the 4 jobs that run one after the other, only when the previous job succeeds Step 4. Create an app to serve documentation The dbt docs generate job generates static HTML documentation for all the dbt models. In this step, you will create an app to serve the documentation. The script for the app will be available in the custom runtime that is provided. Within the Project page, click on Applications Create a new Application Click Set Environment Variable Add the environment variable TARGET_PATH. This should be the same path where dbt docs generated the target folder inside the dbt project. Field Value Comment Name dbt-prod-docs-serve Domain dbtproddocs Script scripts/dbt-docs-serve.py Python script to serve the static HTML doc page generated by dbt docs generate. This is part of the CDSW runtime image. Runtime dbt custom runtime dbt custom runtime which was added to the runtime catalog. Environment Variables TARGET_PATH Target folder path for dbt docs. E.g. /home/cdsw/jaffle_shop/target/ Make sure of the exact path, especially the ‘/’ characters. Note: To update any of the above parameters go back to application -> Application details. Settings -> update application. Click Restart to restart the application. A similar application can be set up to extract the dbt logs Field Value Comment Name dbt-logs-serve Domain dbtlogsserve Script scripts/dbt-logs-serve.py Python script to serve the static HTML doc page generated by dbt docs generate. This is part of the CDSW runtime image. Runtime dbt custom runtime dbt custom runtime which was added to the runtime catalog. Environment Variables TARGET_PATH Target folder path for dbt docs. E.g. /home/cdsw/jaffle_shop/target/ Make sure of the exact path, especially the ‘/’ characters. Description of Production/Stage Deployments Details and logs for jobs Logs are available in the workspace in the project folder The job run details and job logs can be found as follows: Individual job history can be seen at Job run details can be seen by selecting one of the runs Details and logs for doc serve app Logs for running application can be found in applications->logs To fetch the logs, launching the application will enable the log file to be downloaded via browser. Analyst steps Prerequisites Each analyst should have their own credentials to the underlying warehouse. They would need to set a workload password by following Setting the workload password Each analyst has their own schema/database to do their development and testing Each analyst has access to the git repo with the dbt models and has the ability to create PRs in that git repo with their changes. Admins may have to set up proxies to enable this. If you are creating your own repo as an analyst, refer to this Getting started with dbt Core The user SSH key can be configured for access to the git repo (How to work with Github repositories in CML/CDSW - Cloudera Community - 303205) Each analyst has access to the custom runtime that is provided by Cloudera. Admins may have to set up proxies to enable this. Each analyst has permission to create their own project. We suggest that each analyst create their own dev project to work in isolation from other analysts. If not, Admins will have to create the projects using the steps below and provide access to analysts. Step 1. Setup a dev project Step 1.1. Create a CDSW project From the workspace screen, click on Add Project Fill out the basic information for the project Field Value Notes Project Name username-marketing If not shared project, we suggest prefixing the name of the project with the user name so that it is easily identified Project Description Project Visibility Private Recommend private for prod Initial Setup Blank Click the Create Project button on the bottom right corner of the screen Step 1.2. Set environment variables To avoid checking profile parameters (user’s credentials) to git, we leverage environment variables that are set at a Project-level. Click Project Settings from the side menu on the project home page and click on the Advanced tab Enter the environment variables. Click on to add more environment variables. Key Value Notes DBT_USER analyst-user-name Username used by the analyst. See prerequisites. DBT_PASSWORD workload-password Set the workload password by following Setting the workload password DBT_HOST Instance host name DBT_DBNAME Db name to be worked on DBT_SCHEMA Schema used Note: There could be different environment variables that need to be set depending on the specific engine and access methods like Kerberos or LDAP. Refer to the engine-specific adapter documentation to get the full list of parameters in the credentials. Environment variables will look like as shown below: Click the Submit button on the right side of the section Note: You will have to use the credential environment variables in the profiles.yml file in the dbt project that is checked into DBT_GIT_REPO. So, the profiles.yml would look like below: jaffle_shop: target: dev outputs: dev: type: impala host:coordinator-dbt-impala.dw-ciadev.cna2-sx9y.cloudera.site user: "{{ env_var('DBT_USER') }}" password: "{{ env_var('DBT_PASSWORD') }}" port: 5432 dbname: jaffle_shop schema: dbt_alice threads: 4 Note: Environment variables are really flexible. You can use them for any field in the profiles.yml jaffle_shop: target: dev outputs: dev: type: "{{ env_var('DBT_ENGINE_TYPE') }}" host:"{{ env_var('DBT_ENGINE_HOST') }}" user: "{{ env_var('DBT_USER') }}" password: "{{ env_var('DBT_PASSWORD') }}" port: "{{ env_var('DBT_ENGINE_PORT') }}" dbname: "{{ env_var('DBT_DBNAME') }}" schema: "{{ env_var('DBT_SCHEMA') }}" threads: "{{ env_var('DBT_THREADS') }}" Step 2. Setup session for development flow Step 2.1. Create a new session On the project, page click on New Session Fill in the form for the session Field Value Notes Session name dev-user-session This private session will be used by the analyst for their work Runtime Editor Workbench Kernel Python 3.9 Edition dbt cdsw custom runtime Version 1.1 Automatically picked up from the runtime Enable Spark Disabled Runtime image Automatically picked up Resource Profile 1 vCPU/2GB Memory Click on “Start Session”. Click on Terminal Access to open a shell Step 2.2. Clone dbt repository to start working on it Clone the repository from within the terminal. Note that the ssh key for git access is a prerequisite. git clone git@github.com:cloudera/dbt-hive-example.git Once you clone the repo, you can browse the files in the repo and edit them in the built-in editor. If the repository does not already have a profiles.yml, create your own yml file within the terminal and run dbt debug to verify that the connection works. $ mkdir $HOME/.dbt $ cat > $HOME/.dbt/profiles.yml dbt_impala_demo: outputs: dev: type: impala host: demodh-manager0.cdpsaasd.eu55-rsdu.cloudera.site port: 443 dbname: dbt_test schema: dbt_test user: "{{ env_var('DBT_USER') }}" password: "{{ env_var('DBT_PASSWORD') }}" auth_type: ldap use_http_transport: true use_ssl: true http_path: demodh/cdp-proxy-api/impala target: dev $ cd dbt-impala-example/dbt_impala_demo/ $ dbt debug Now you are all set! You can start making changes to your models in the code editor and testing them. Conclusion In this document we have shown the different requirements that need to be met to support the full software development life cycle of dbt models. The table below shows how those requirements have been met. Requirement Will this option satisfy the requirement? If yes, how? Have multiple environments Dev Stage Prod Yes, as explained above. Have a dev setup where different users can do the following (in an isolated way): 1. Make changes to models Yes, per user in their Session in the workspace, having checked out their own branch of the given dbt project codebase. 2. Test changes Yes 3. See logs of tests Yes 4. Update docs in the models and see docs Yes, by running the dbt docs server as a CDSW Application. Have a CI/CD pipeline to push committed changes in the git repo to stage/prod environments Yes, either: Simple git-push and delegating to external CI/CD system Configuring a CDSW Job. See logs in stage/prod of the dbt runs Yes See dbt docs in stage/prod Yes Convenient for analysts - no terminal/shells/installing software on a laptop. Should be able to use a browser. Yes, he gets a shell via CDSW Support isolation across different users using it in dev Yes, each Session workspace is isolated. Support isolation between different environments (dev/stage/prod) Yes Secure login - SAML etc Yes, controlled by the customer via CDSW Be able to upgrade the adapters or core seamlessly Cloudera will publish new Runtimes. versions of the python packages to PyPI Vulnerability scans and fixing CVEs Cloudera will do scans of the adapters and publish new versions with fixes. Ability to run dbt run regularly to update the models in the warehouse Yes, via CDSW Jobs. You can reach out to innovation-feedback@cloudera.com if you have any questions.
... View more
09-09-2022
09:30 PM
This week’s release includes: Adapters & docker images: dbt-spark-livy - 1.3.1 (Dec 5th, 2022) dbt-spark-cde - 1.3.0 (Dec 5th, 2022) dbt-hive - 1.3.1 (Dec 5th, 2022) dbt-impala - 1.3.1 (Dec 5th, 2022) For CML/CDSW deployment public.ecr.aws/d7w2o6p0/dbt-cml:1.2.0 (with Jupyter Interface): Name Version dbt-core 1.2.0 dbt-impala 1.2.0 dbt-hive 1.2.0 dbt-spark-cde 1.2.0 dbt-spark-livy 1.2.0 CML Base image ml-runtime-jupyterlab-python3.9-standard:2022.04.1-b6 .py scripts (Utility) n/a public.ecr.aws/d7w2o6p0/dbt-cdsw:1.2.0 (with workbench editor): Name Version dbt-core 1.2.0 dbt-impala 1.2.0 dbt-hive 1.2.0 dbt-spark-cde 1.2.0 dbt-spark-livy 1.2.0 CML Base image ml-runtime-workbench-python3.9-standard:2022.04.1-b6 .py scripts (Utility) n/a Note: Both CML and CDSW docker image works with CDSW, though later one only support workbench editor while setting up jobs. Supported infrastructure: All adapters we have released support CDP Public Cloud LDAP with Knox Our Impala and Hive adapters support CDP Private Cloud with Kerberos, we are testing our Spark adapters for the same Both Impala and hive adapters support Local Server without authentication Deployment Options: Form Factor On-Prem Cloud Cloud Cloud Cloud PvC Data Services CDPOne CDP PaaS Data Services CDP PaaS Data Services CDP PaaS Datahub dbt SDLC requirements CDSW CML CML CDE CML Tested artifacts dbt core and adapters Custom runtime Custom Runtime Custom Runtime Pypi packages Custom Runtime Authoring/testing dbt develop CDSW workbench session CML Jupyter Session CML Jupyter Session VSCode/Other IDE CML Jupyter Session Orchestration dbt run CDSW job CML jobs CML jobs Compile dbt models to airflow dag OR Run dbt run as custom bash operator CML jobs Collaboration dbt doc serve CDSW App CML App CML App Flask server OR S3 CML App Past adapter releases: dbt-hive adapter 1.3.0 (Nov 24th, 2022) 1.2.0 (Nov 4th, 2022) Now dbt-hive adapter supports dbt core 1.2.0. 1.1.5 (Oct 28th, 2022) 1.1.4 (Sep 23rd, 2022) Adding support for Kerberos auth mechanism. Along with an updated instrumentation package. 1.1.3 (Sep 9th, 2022) Added a macro to detect the hive version, to determine if the incremental merge is supported by the warehouse. 1.1.2 (Sep 2nd, 2022) dbt seeds command won't add additional quotes to string, which was a known bug in the previous release. All warehouse properties(Cluster_by, Comment, external table, incremental materialization methods, etc) are tested and should be working smoothly with the adapter. Added instrumentation to the adapter 1.1.1 (August 23rd, 2022) Cloudera released the first version of the dbt-hive adapter dbt-impala adapter 1.3.0 (Nov 18th, 2022) 1.2.0 (Nov 2nd, 2022) Now dbt-impala adapter supports dbt core 1.2.0 1.1.5 (Oct 28th, 2022) 1.1.4 (Sep 30th, 2022) Now any dbt profiles errors or connection issues using dbt commands will show a user-friendly message for dbt-impala adapter. Added user-agent string to improve instrumentation 1.1.3 (Sep 17th, 2022) Adding support for append mode when partition_by clause is used. Along with an updated instrumentation package. 1.1.2 (Aug 5th, 2022) Now dbname in profile.yml file is optional; Updated a dependency in README; dbt-core version updates automatically in setup.py 1.1.1 (Jul 16th, 2022) Bug fixes for a specific function 1.1.0 (Jun 9th, 2022) Adapter migration to dbt-core-1.1.0; added time-out for snowplow endpoint to handle air-gapped env 1.0.6 (May 23rd, 2022) Added support to insert_overwrite mode for incremental models and added instrumentation to the adapter 1.0.5 (Apr 29th, 2022) Added support to an EXTERNAL clause with table materialization & improved error handling for relation macros 1.0.4 (Apr 1st, 2022) Added support to Kerberos authentication method and dbt-docs 1.0.1 (Mar 25th, 2022) Cloudera released the first version of the dbt-impala adapter dbt-spark-cde adapter 1.2.0 (Nov 14th, 2022) 1.1.7 (Oct 28th, 2022) 1.1.6 (Oct 18th, 2022) Added way to switch on/off the SSL certificate verification for CDE endpoint. Along with updated instrumentation package. 1.1.5 ( Oct 15th, 2022) 1.1.4 (Sep 23rd, 2022) During internal testing, we came across an issue where the second run of the incremental model was failing. We have fixed that issue. For improved debugability, if a CDE job fails adapter will create a new log file inside the dbt log folder which contains the stderr output. A sample file looks like this: dbt-job-1663938116617-00000255.stderr.log 1.1.3 (Sep 10th, 2022) The detail for each query is now available in the logs: dbt.log Spark CDE session parameters can be provided via dbt's profile.yml file as key: value pair Any dbt profiles errors or connection issues using dbt commands will show a user-friendly message: 1.1.2 (Sep 2nd, 2022) Time out is added while polling job status to save resource hogging and code is clean and if enabled, spark events can also be seen with a new method 1.1.1 (Aug 26th, 2022) Improved debugging process to track JobId, Query, and session time. Access stderr and stdout of CDE jobs in dbt logs. 1.1.0 (Jul 21st, 2022) Cloudera released the first version of the dbt-spark-cde adapter that supports connection to Cloudera Data Engineering backend using CDE APIs dbt-spark-livy adapter 1.3.0 (Nov 21st, 2022) 1.2.0 (Nov 9th, 2022) 1.1.8 (Oct 28th, 2022) 1.1.7 (Oct 18th, 2022) Added way to switch on/off the SSL certificate verification for Livy endpoint. 1.1.6 (Oct 15th, 2022) 1.1.5 (Sep 30th, 2022) Added Kerberos support: Along with an updated instrumentation package. 1.1.4 (Sep 17th, 2022) Now any dbt profiles errors or connection issues using dbt commands will show a user-friendly message for dbt-spark adapters Spark session parameters can be provided via dbt's profile.yml file as key: value pair for dbt-spark adapters 1.1.3 (Jul 29th, 2022) Added instrumentation to the adapter and updated Setup.py as per upstream 1.1.2 (Jul 1st, 2022) Bug fixes to show an error when SQL model has some issue 1.1.1 (Jul 1st, 2022) Instructions for IDBroker Mappings in the ReadMe file and some minor changes to the setup and version files 1.1.0 (Jun 17th, 2022) Cloudera released the first version of the dbt-Spark-livy adapter to support Livy-based connection to the Cloudera Data Platform Available resources: Articles: Running dbt core with adapters for Hive, Spark, and Impala within CDP Public Cloud Running dbt core with adapters for Hive, Spark, and Impala within CDP Private Cloud Quick start guides for specific adapters dbt-impala dbt-hive dbt-spark-livy dbt-spark-cde Bundled offering for CML & CDSW deployment: public.ecr.aws/d7w2o6p0/dbt-cml:1.1.16 public.ecr.aws/d7w2o6p0/dbt-cdsw:1.1.16 GitHub repository: Source Code: A dbt adapter for Apache Impala & Cloudera Data Platform A dbt adapter for Apache Hive & Cloudera Data Platform A dbt adapter for Apache Spark with CDE API support A dbt adapter for Apache Spark-livy Sample project: A sample project for the dbt-impala adapter with Cloudera Data Platform A sample project for the dbt-hive adapter with Cloudera Data Platform A sample project for the dbt-spark adapter with CDE API support on the Cloudera Data Platform A sample project for the dbt-spark-livy adapter on the Cloudera Data Platform Python Packages: dbt-impala · PyPI dbt-hive · PyPI dbt-spark-cde · PyPI dbt-spark-livy · PyPI
... View more
08-28-2022
08:26 PM
Hi Asim, If you can share more details about what you are trying to achieve here, we can help you with that. Last week we have released dbt-hive adapter if it helps you here is an article to get started with it.
... View more
08-26-2022
06:12 PM
Overview Cloudera has implemented dbt adapters for Hive, Spark (CDE and Livy), and Impala. In addition to providing the adapters, Cloudera is offering a turn-key solution to be able to manage the end-to-end software development life cycle (SDLC) of dbt models. This solution is available in all clouds as well as on-prem deployments of CDP. It is useful for customers who would not like to use dbt Cloud for security reasons or the lack of availability of adapters in dbt cloud. We have identified the following requirements for any solution that supports the end-to-end SDLC of data transformation pipelines using dbt. Have multiple environments Dev Stage/Test Prod Have a dev setup where different users can do the following in an isolated way: Make changes to models Test changes See logs of tests Update docs in the models and see docs Have a CI/CD pipeline to push committed changes in the git repo to stage/prod environments See logs in stage/prod of the dbt runs See dbt docs in stage/prod Orchestration: Ability to run dbt run regularly to update the models in the warehouse, or based on events(Kafka) Everything should be part of one application(tool) like CDP or CML Alerting and Monitoring, if there is a failure how IT team will know that Any deployment for dbt should also satisfy the following Convenient for analysts - no terminal/shells/installing software on a laptop. Should be able to use a browser. Support isolation across different users using it in dev Support isolation between different environments (dev/stage/prod) Secure login - SAML Be able to upgrade the adapters or core seamlessly Vulnerability scans and fixing CVEs Able to add and remove users for dbt - Admin privilege Cloudera Data Platform has a service, CML, which offers users the ability to build and manage all of their machine learning workloads. The same capabilities of CML can also be used to satisfy the requirements for the end-to-end SDLC of dbt models. In this document, we will show how an admin can set up the different capabilities in CML like workspaces, projects, sessions, and runtime catalogs so that an analyst can work with their dbt models without having to worry about anything else. First, we will show how an admin can set up CML workspaces for different environments - development/stage/production CML runtime catalog within a workspace, with the Cloudera provided container with dbt-core and all adapters supported by Cloudera CML project for stage/prod (i.e., automated/non-development) environments. Analysts create their own projects for their development work. CML jobs to run the following commands in an automated way on a regular basis git clone dbt debug dbt run dbt doc generate CML apps to serve model documentation in stage/prod Next, we will show how an analyst can build, test, and merge changes to dbt models by using CML project to work in isolation without being affected by other users CML user jupyter sessions - for interactive IDE of dbt models git to get the changes reviewed and pushed to production Finally, we will show how by using CML all of the requirements listed above can be satisfied. Administrator steps Prerequisites Administrator should have access to the CDP Control Plane and admin permissions to CML There should be a CDP Environment available for use. If there isn’t one, create one as per documentation Register an AWS environment from CDP UI Access to a git repository with basic dbt scaffolding (using proxies if needed). If such a repository does not exist, follow the steps in Getting started with dbt Core Access to custom runtime catalog (using proxies if needed) Machine user credentials - user/pass or kerberos - for stage and production environments. See CDP machine user on creating machine users for hive/impala/spark. Note: The document details a simple setup within CML where we will use one workspace for dbt for all of the dev/stage/prod environments. use one project each for stage/prod and one per user to provide access isolation use one jupyter session per user/analyst for their development, testing, and to push PRs A more robust setup where dev, stage, and prod are isolated from each other at the hardware level can also be done by creating separate workspaces for each of dev, stage, and prod environments. We have tested this offering only on AWS at this point but we are working to test it on other public cloud providers. Step 1. Create a workspace for dbt Create a new workspace for dbt by clicking on Provision Workspace in the CML Workspaces tab. This will open up a form with several fields. Enable the toggle “Advanced Options”. See the table below for the recommended values for each of the fields Field Name Recommended Value Description Workspace Name dbt-workspace This workspace will be used for all of dev/stage/prod. Or you can create one each for dev/stage/prod Environment wh-env Important: Pick the same CDP Environment that has the datahub cluster with hive/impala/spark or the CDW (hive/impala) OR CDE (spark) services. Otherwise, you will have to set up the right network settings to allow access to the hive/impala/spark clusters. CPU Settings Instance Type m5.2xlarge,8 CPU,32GiB Autoscale Range 0-2 Root Volume Size 512 GPU Instances Disable For dbt, the user won’t need GPU Instances so that can be turned off. Network Settings Subnets for Worker Nodes subnet1, subnet2 Pick the subnets that have the datahub cluster or the CDW/CDE services. See documentation below on how to get the subnet values. Subnets for Load Balancer subnet1,subnet2 Same as worker nodes Load Balance Source Ranges Skip Enable Fully Private Cluster Disable Enable Public IP Address for Load Balancer Enable Restrict access to Kubernetes API server to authorized IP ranges Disable Production Machine Learning Enable Governance Disable Enable Model Metrics Disable Other Settings Enable TLS Enable Enable Monitoring Enable Skip Validation Enable Tags Skip CML Static Subdomain Skip Click on Provision Workspace after filling out the form. Workspace creation takes several minutes. You can see the workspace being created along with logs in the workspace page. Once created, the workspace shows up in the workspace list. Click on the workspace to then work within the workspace Find the subnets needed for CML Workspace Admin will need to choose two public subnets from the drop-down. Two subnets are a CML requirement. We recommend using the same subnets and running the warehouse instances. If the admin is not sure which two subnets to select, they can obtain the subnets by following these steps: From CDP Console, click on the Management Console tile and then click on the environment name from the list which is used to create the CML workspace, as shown below: On the Environments page, click on the Summary tab: Scroll down to the Network section and all the subnets are listed in the section, as shown below Step 2. Create and enable a custom runtime in CML with dbt In the workspace screen, click on “Runtime Catalog” to create a custom runtime with dbt. Step 2.1. Create a new runtime environment Select the Runtime Catalog from the side menu, and click Add Runtime button: Find most recent docker image here, paste most recent image and click on Validate. When validation succeeds, click on “Add to Catalog”. The new runtime will show up in the list of runtimes Step 2.2. Set runtime as default for all new sessions In the workspace’s side menu, select Site Administration and scroll down to the Engine Images section. Add a new Engine image by adding the following values Field Value Description Description dbt-cml Respository:Tag public.ecr.aws/d7w2o6p0/dbt-cml:1.1.15 Find most updated docker image here Default Enable Make this runtime the default for all new sessions in this workspace Step 3. Setup projects for stage and prod (automated) environments Admins create projects for stage and prod (and other automated) environments. Analysts can create their own projects. Creating a new project for stage/prod requires the following steps: Create a CML project Set up environment variables for credentials and scripts Step 3.1. Create a CML project From the workspace screen, click on Add Project Fill out the basic information for the project Field Value Notes Project Name prod-marketing Name of the dbt project running in stage/prod Project Description Project Visibility Private Recommend private for prod and stage Initial Setup Blank We will set up git repos separately via CML Jobs later in prod/stage. Runtime setup - Click on the “Advanced” tab. Add a new runtime by providing the following values and click on add runtime Field Value Notes Editor JupyterLab Kernel Python 3.9 Edition dbt custom runtime This is the custom runtime that was added earlier Version 1.2.0 This version is automatically picked up from the custom runtime. Click the Create Project button on the bottom right corner of the screen Note: Since JupyterLab creates checkpoints, it may mess up with the dbt project. Follow the steps below to avoid this: Redirecting Jupyter Lab’s checkpoints Since JupyterLab creates checkpoints in each directory this interrupts with dbt project file structure and may cause an error though we can redirect these checkpoints to a specified folder. Follow the steps to redirect Jupyter lab checkpoints: Open the terminal by clicking the Terminal tile. The terminal looks like this: Create a new directory in the /home/cdsw folder by running the following command: mkdir checkpoints Copy the script to .jupyter/ directory: cp /build/jupyter_notebook_config.py .jupyter/ Restart the session and checkpoints will be redirected to the specified directory Step 3.2. Set environment variables to be used in automation To avoid checking profile parameters (users credentials) to git, the user SSH key can be configured for access to the git repo (How to work with Github repositories in CML/CDSW - Cloudera Community - 303205) Click Project Settings from the side menu on the project home page and click on Advanced tab Enter the environment variables. Click on to add more environment variables. Key Value Notes DBT_GIT_REPO https://github.com/cloudera/dbt-impala-example.git Repository that has the dbt models and profiles.yml DBT_IMPALA_HOST DBT_IMPALA_HTTP_PATH DBT_IMPALA_USER DBT_IMPALA_PASSWORD DBT_IMPALA_DBNAME DBT_IMPALA_SCHEMA DBT_SPARK_CDE_HOST DBT_SPARK_CDE_AUTH_ENDPOINT DBT_SPARK_CDE_PASSWORD DBT_SPARK_CDE_USER DBT_SPARK_CDE_SCHEMA DBT_SPARK_LIVY_HOST DBT_SPARK_LIVY_USER DBT_SPARK_LIVY_PASSWORD DBT_SPARK_LIVY_DBNAME DBT_SPARK_LIVY_SCHEMA DBT_HIVE_HOST DBT_HIVE_HTTP_PATH DBT_HIVE_USER DBT_HIVE_PASSWORD DBT_HIVE_SCHEMA DBT_HOME DBT_GIT_REPO Adapter specific configs passed as environment variables Note: There could be different environment variables that need to be set depending on the specific engine and access methods like Kerberos or LDAP. Refer to the engine-specific adapter documentation to get the full list of parameters in the credentials. Environment variables will look like as shown below: Click the Submit button on the right side of the section Note: You will have to use the credential environment variables in the profiles.yml file in the dbt project that is checked into DBT_GIT_REPO. So, the profiles.yml would look like below: dbt_impala_demo: outputs: dev_cia_cdp: type: impala host: "{{ env_var('DBT_IMPALA_HOST') }}" http_path: "{{ env_var('DBT_IMPALA_HTTP_PATH') }}" port: 443 auth_type: ldap use_http_transport: true use_ssl: true username: "{{ env_var('DBT_IMPALA_USER') }}" password: "{{ env_var('DBT_IMPALA_PASSWORD') }}" dbname: "{{ env_var('DBT_IMPALA_DBNAME') }}" schema: "{{ env_var('DBT_IMPALA_SCHEMA') }}" target: dev_cia_cdp Note: Environment variables are really flexible. You can use them for any field in the profiles.yml jaffle_shop: target: dev outputs: dev: type: "{{ env_var('DBT_ENGINE_TYPE') }}" host:"{{ env_var('DBT_ENGINE_HOST') }}" user: "{{ env_var('DBT_USER') }}" password: "{{ env_var('DBT_PASSWORD') }}" port: "{{ env_var('DBT_ENGINE_PORT') }}" dbname: "{{ env_var('DBT_DBNAME') }}" schema: "{{ env_var('DBT_SCHEMA') }}" threads: "{{ env_var('DBT_THREADS') }}" In Step 4.3. Setup dbt debug job, you will be able to test and make sure that the credentials provided to the warehouse are accurate. Step 4. Create jobs and pipeline for stage/prod CML jobs will be created for the following jobs to be run in order as a pipeline to be run on a regular basis whenever there is a change pushed to the dbt models repository. Get the scripts for the different jobs git clone/pull dbt debug dbt run dbt docs generate All the scripts for the jobs are available in the custom runtime that is provided. These scripts rely on the project environment variables that have been created in the previous section. Step 4.1 Setup scripts location Scripts are present under the /scripts folder as part of the dbt custom runtime. However, the CML jobs file interface only lists the files under the home directory (/home/cdsw). Create a session with the custom runtime: and from the terminal command line, copy the scripts to the home folder. cp -r /scripts /home/cdsw/ Step 4.2. Setup git clone job Create a new job for git clone and select the job script from the scripts folder updated in Step 4.1 Update the arguments and environment variables, and create the job. Field Name Value Comment Name job-git-clone Script scripts/job-git-clone.py This is the script which would be executed in this job step. Arguments /home/cdsw/dbt-impala-example/dbt_impala_demo Path of the dbt project file, which is part of the repo. Editor JupyterLab Kernel Python 3.9 Edition dbt custom runtime Version 1.1 Runtime Image public.ecr.aws/d7w2o6p0/dbt-cml:1.1.15 Find most updated docker image here Schedule Recurring; Every hour This can be configured as either Manual/Recurring or Dependent Use a cron expression Check, 0**** Default value Resource profile 1vCPU/2GiB Timeout In Minutes - Optional timeout for the job Environment Variables These can be used to overwrite settings passed at project level (Section 3.2) Job Report Recipients Recipients to be notified on job status Attachments Attachments if any Step 4.3. Setup dbt debug job Field Name Value Comment Name job-dbt-debug Script scripts/job-dbt-debug.py This is the script that would be executed in this job step. Arguments /home/cdsw/dbt-impala-example/dbt_impala_demo Path of the dbt project file, which is part of the repo. Editor JupyterLab Kernel Python 3.9 Edition dbt custom runtime Version 1.1 Runtime Image public.ecr.aws/d7w2o6p0/dbt-cml:1.1.15 Find most updated docker image here Schedulable Dependent Make sure that this job runs only after cloning/updating the git repo. job-git-clone Job-dbt-debug is dependent on job-git-clone, and will run only after it completes. Resource profile 1vCPU/2GiB Timeout In Minutes - Optional timeout for the job Environment Variables These can be used to overwrite settings passed at project level (Section 3.2) Job Report Recipients Recipients to be notified on job status Attachments Attachments if any Step 4.4. Setup dbt run job Field Name Value Comment Name job-dbt-run Script scripts/job-dbt-run.py This is the script which would be executed in this job step. Arguments /home/cdsw/dbt-impala-example/dbt_impala_demo Path of the dbt project file, which is part of the repo. Editor JupyterLab Kernel Python 3.9 Edition dbt custom runtime Version 1.1 Runtime Image public.ecr.aws/d7w2o6p0/dbt-cml:1.1.15 Find most updated docker image here Schedulable Dependent Make sure that this job depends on dbt-debug job. job-dbt-debug Job-dbt-run is dependent on job-dbt-debug, and will run only after it completes. Resource profile 1vCPU/2GiB Timeout In Minutes - Optional timeout for the job Environment Variables These can be used to overwrite settings passed at project level (Section 3.2) Job Report Recipients Recipients to be notified on job status Attachments Attachments if any Step 4.5. Setup dbt docs generate job Field Name Value Comment Name job-doc-generate Script scripts/dbt-docs-generate.py This is the script which would be executed in this job step. Arguments /home/cdsw/dbt-impala-example/dbt_impala_demo Path of the dbt project file, which is part of the repo. Editor JupyterLab Kernel Python 3.9 Edition dbt custom runtime Version 1.1 Runtime Image public.ecr.aws/d7w2o6p0/dbt-cml:1.1.15 Find most updated docker image here Schedulable Dependent Generate docs only after the models have been updated. dbt-docs-run dbt-docs-generate is dependent on job-dbt-run, and will run only after it completes. Resource profile 1vCPU/2GiB Timeout In Minutes - Optional timeout for the job Environment Variables These can be used to overwrite settings passed at project level (Section 3.2) Job Report Recipients Recipients to be notified on job status Attachments Attachments if any After following the 4 steps above, there will be a pipeline with the 4 jobs that run one after the other, only when the previous job succeeds Step 5. Create an app to serve documentation The dbt docs generate job generates static HTML documentation for all the dbt models. In this step, you will create an app to serve the documentation. The script for the app will be available in the custom runtime that is provided. Within the Project page, click on Applications Create a new Application Click Set Environment Variable Add the environment variable TARGET_PATH. This should be the same path where dbt docs generated the target folder inside the dbt project. Field Value Comment Name dbt-prod-docs-serve Domain dbtproddocs Script scripts/dbt-docs-serve.py Python script to serve the static HTML doc page generated by dbt docs generate. This is part of the CML runtime image. Runtime dbt custom runtime dbt custom runtime which was added to the runtime catalog. Environment Variables TARGET_PATH Target folder path for dbt docs. E.g. /home/cdsw/jaffle_shop/target/ Make sure of the exact path, especially the ‘/’ characters. Note: To update any of the above parameters go back to application -> Application details. Settings -> update application. Click Restart to restart the application. Description of Production/Stage Deployments Details and logs for jobs Logs are available in the workspace in the project folder The job run details and job logs can be found as follows: Individual job history can be seen in the Jobs section from the side menu Job run details can be seen by clicking on any Run Details and logs for doc serve app Logs for running application can be found in applications->logs Analyst steps Prerequisites Each analyst should have their own credentials to the underlying warehouse. They would need to set a workload password by following Setting the workload password Each analyst has their own schema/database to do their development and testing Each analyst has access to the git repo with the dbt models and has the ability to create PRs in that git repo with their changes. Admins may have to set up proxies to enable this. If you are creating your own repo as an analyst, refer to this Getting started with dbt Core The user SSH key can be configured for access to the git repo (How to work with Github repositories in CML/CDSW - Cloudera Community - 303205) Each analyst has access to the custom runtime that is provided by Cloudera. Admins may have to set up proxies to enable this. Each analyst has permission to create their own project. We suggest that each analyst create their own dev project to work in isolation from other analysts. If not, Admins will have to create the projects using the steps below and provide access to analysts. Step 1. Setup a dev project Step 1.1. Create a CML project From the workspace screen, click on Add Project Fill out the basic information for the project Field Value Notes Project Name username-marketing If not using a shared project, we suggest prefixing the name of the project with the user name so that it is easily identified Project Description Project Visibility Private Recommend private for prod Initial Setup Blank Runtime setup - Click on the “Advanced” tab. Add a new runtime by providing the following values and click on add runtime Field Value Notes Editor JupyterLab Kernel Python 3.9 Edition dbt custom runtime This is the custom runtime that was added by the admin in earlier steps. Version 1.1 This version is automatically picked up from the custom runtime. Click the Create Project button on the bottom right corner of the screen Step 1.2. Set environment variables To avoid checking profile parameters (users credentials) to git, we leverage environment variables that are set at a Project-level. Click Project Settings from the side menu on the project home page and click on the Advanced tab Enter the environment variables. Click on to add more environment variables. Key Value Notes DBT_USER analyst-user-name Username used by the analyst. See prerequisites. DBT_PASSWORD workload-password Set the workload password by following Setting the workload password DBT_HOST Instance host name DBT_DBNAME Db name to be worked on DBT_SCHEMA Schema used Note: There could be different environment variables that need to be set depending on the specific engine and access methods like Kerberos or LDAP. Refer to the engine-specific adapter documentation to get the full list of parameters in the credentials. Environment variables will look like as shown below: Click the Submit button on the right side of the section Note: You will have to use the credential environment variables in the profiles.yml file in the dbt project that is checked into DBT_GIT_REPO. So, the profiles.yml would look like below: jaffle_shop: target: dev outputs: dev: type: impala host:coordinator-dbt-impala.dw-ciadev.cna2-sx9y.cloudera.site user: "{{ env_var('DBT_USER') }}" password: "{{ env_var('DBT_PASSWORD') }}" port: 5432 dbname: jaffle_shop schema: dbt_alice threads: 4 Note: Environment variables are really flexible. You can use them for any field in the profiles.yml jaffle_shop: target: dev outputs: dev: type: "{{ env_var('DBT_ENGINE_TYPE') }}" host:"{{ env_var('DBT_ENGINE_HOST') }}" user: "{{ env_var('DBT_USER') }}" password: "{{ env_var('DBT_PASSWORD') }}" port: "{{ env_var('DBT_ENGINE_PORT') }}" dbname: "{{ env_var('DBT_DBNAME') }}" schema: "{{ env_var('DBT_SCHEMA') }}" threads: "{{ env_var('DBT_THREADS') }}" Step 2. Setup jupyter session for development flow Step 2.1. Create a new jupyter session On the project, page click on New Session Fill in the form for the session Field Value Notes Session name dev-user-session This private session will be used by the analyst for their work Runtime Editor JupyterLab Kernel Python 3.9 Edition dbt custom runtime Version 1.1 Automatically picked up from the runtime Enable Spark Disabled Runtime image Automatically picked up Resource Profile 1 vCPU/2GB Memory Click on “Start Session”. Ignore these screens if they show up. Click on Terminal to open a shell Redirecting Jupyter Lab’s checkpoints Since JupyterLab creates checkpoints in each directory this interrupts with dbt project file structure and may cause an error though we can redirect these checkpoints to a specified folder. Follow the steps to redirect Jupyter lab checkpoints: Open the terminal by clicking the Terminal tile. The terminal looks like this: Create a new directory in the /home/cdsw folder by running the following command: mkdir checkpoints Copy the script to .jupyter/ directory: cp /build/jupyter_notebook_config.py .jupyter/ Restart the session and checkpoints will be redirected to the specified directory Step 2.2. Clone dbt repository to start working on it Clone the repository from within the terminal. Note that the ssh key for git access is a prerequisite. Sample command: git clone git@github.com:cloudera/dbt-impala-example.git Once you clone the repo, you can browse the files in the repo and edit them in the built-in editor. If the repository does not already have a profiles.yml, create your own yml file within the terminal and run dbt debug to verify that the connection works. $ mkdir $HOME/.dbt
$ cat > $HOME/.dbt/profiles.yml
dbt_impala_demo:
outputs:
dev:
type: impala
host: demodh-manager0.cdpsaasd.eu55-rsdu.cloudera.site
port: 443
dbname: dbt_test
schema: dbt_test
user: "{{ env_var('DBT_USER') }}"
password: "{{ env_var('DBT_PASSWORD') }}"
auth_type: ldap
use_http_transport: true
use_ssl: true
http_path: demodh/cdp-proxy-api/impala
target: dev
$ cd dbt-impala-example/dbt_impala_demo/
$ dbt debug The environment variables shown above can be used to avoid having the user credentials as part of the git repo if the profile.yml is checked into git. Alternatively, the env variables can be exported as command line as well before executing the dbt commands. export DBT_USER=srv_cia_test_user export DBT_PASSWORD=srv_cia_test_user_password Now you are all set! You can start making changes to your models in the code editor and testing them. Conclusion In this document we have shown the different requirements that need to be met to support the full software development life cycle of dbt models. The table below shows how those requirements have been met. Requirement Will this option satisfy the requirement? If yes, how? Have multiple environments Dev Stage Prod Yes, as explained above. Have a dev setup where different users can do the following (in an isolated way): Make changes to models Yes, per user in their Session in the workspace, having checked out their own branch of the given dbt project codebase. Test changes Yes See logs of tests Yes Update docs in the models and see docs Yes, by running the dbt docs server as a CML Application. Have a CI/CD pipeline to push committed changes in the git repo to stage/prod environments Yes, either: Simple git-push and delegating to external CI/CD system Configuring a CML Job. See logs in stage/prod of the dbt runs Yes See dbt docs in stage/prod Yes Convenient for analysts - no terminal/shells/installing software on a laptop. Should be able to use a browser. Yes, user gets a shell via CML Support isolation across different users using it in dev Yes, each Session workspace is isolated. Support isolation between different environments (dev/stage/prod) Yes Secure login - SAML etc Yes, controlled by the customer via CML Be able to upgrade the adapters or core seamlessly Cloudera will publish new Runtimes. versions of the python packages to PyPI Vulnerability scans and fixing CVEs Cloudera will do scans of the adapters and publish new versions with fixes. Ability to run dbt run regularly to update the models in the warehouse Yes, via CML Jobs. You can reach out to innovation-feedback@cloudera.com if you have any questions.
... View more
08-25-2022
11:23 PM
1 Kudo
Overview This handbook focuses on Cloudera’s new adapter dbt-hive that enables users to use dbt on top of the Hive data warehouse system. Links to a dbt example project that can be used to bootstrap a user’s dbt journey can be found below. Version Notes: The adapter has been tested on the following version: python: 3.9.13 Cloudera Manager CM VERSION 7.6.2 RUNTIME VERSION 7.2.15 Cloudera Data Warehouse on CDP v 1.4.2-b118 dbt-core: 1.2.0 Installing dbt and requirements Prerequisites: Make sure python3, git, and pip are available on your system. Cloning the repo: Run the following command in your terminal to clone the example project repository: git clone https://github.com/cloudera/dbt-hive-example/ Installing the packages: Run the following commands in your terminal to install the required packages/dependencies using pip/pip3: python3 -m pip install dbt-core python3 -m pip install sqlparams==3.0.0 python3 -m pip install impyla==0.18 python3 -m pip install dbt-hive Cloudera Data Warehouse Cloudera Data Warehouse (CDW) is a CDP Public Cloud service for the self-service creation of independent data warehouses and data marts that autoscale up and down to meet varying workload demands. The Data Warehouse service provides isolated compute instances for each data warehouse/mart, automatic optimization and enables you to save costs while meeting SLAs. Both Apache Impala and Apache Hive are available through Cloudera Data Warehouse. Setup CDP To work with Hive in CDP, we need two things - a Hive Virtual Warehouse (part of Cloudera Data Warehouse) and a user to run queries. User For this demo, we will use a Machine User created inside CDP as the user running queries. Inside CDP > User Management, add a new Machine User and set the workload password. The steps for this are documented here: Creating a Machine User Setting the workload password With the user created & the workload password set, take a note of the Workload username & password. Notice in the below screenshot, that for a Machine User called ‘cia_test_user’ the workload username is ‘srv_cia_test_user’. Keep the workload user & password details handy for later. Setting up Cloudera Data Warehouse (CDW) Cluster Cloudera Data Warehouse (CDW) Public Cloud service, you can create independent data warehouses and data marts for teams of business analysts without the overhead of bare metal deployments. CDW includes Database Catalogs and Virtual Warehouses that you use to access your data. We will be using Hive through Cloudera Data Warehouse Start by activating your CDW Environment as documented here This will create a default database catalog, which we will use in this demo. You are able to create non-default database catalogs, as documented here Next, create a Hive Virtual Warehouse connected to the default database catalog, as documented here Choose the virtual warehouse type as a hive. The following settings will be used for Hive Virtual Warehouse in this demo: Once created, you should see that the Virtual Warehouse enters the running state From here, select the 3 dots, and then Copy JDBC URL. In my case, this looks like this: jdbc:hive2://hs2-dbt-hive.dw-ciadev.cna2-sx9y.cloudera.site/default;transportMode=http;httpPath=cliservice;socketTimeout=60;ssl=true;retries=3; Keep this URL handy for later. Connecting dbt to Hive warehouse dbt needs a profile to connect to your data warehouse. The profile lives in a .dbt directory in your home directory and is called profiles.yml. On Unix-based systems, this would look like ~/.dbt/profiles.yml. If you haven't used dbt before, create the directory with the command: mkdir ~/.dbt and create the profiles.yml file with your favourite text editor. You can learn more about the dbt profile from the dbt docs here. Use the following template for the contents of the file: dbt_hive_demo: outputs: dev: type: hive schema: <schema_name> dbname: <db_name> auth_type: ldap host: hs2-dbt-hive.dw-ciadev.cna2-sx9y.cloudera.site threads: <no_of_threads> user: srv_cia_test_user password: Password123! target: dev First, add your Workload User Name and Workload Password to the “user” and “password” fields. Next, add the “host” field in the template we noted earlier from CDP UI The sample completed profile looks like this: dbt_hive_demo: outputs: dev: type: hive schema: dbt_hive_demo dbname: dbt_hive_demo auth_type: ldap host: hs2-dbt-hive.dw-ciadev.cna2-sx9y.cloudera.site threads: 2 user: srv_cia_test_user password: Password123! target: dev Note: Remember to use proper syntax for dbname and schema name. Valid names should just include letters, numbers, and underscores {a-zA-z_0-9). No dashes or hyphens are allowed. To ensure we’ve configured our profile correctly, run a connection test. In the project directory, for us dbt_hive_demo run the following command: dbt debug In the output of this command, you should see the following: All checks passed similar to this. Running the demo project In the example repo, we have a sample dbt project called ‘dbt_hive_demo’. Inside this demo project, we can issue dbt commands to run parts of the project. The demo project contains examples of the following operations and commands: Generating fake data dbt seed dbt test dbt run Sources View models and incremental table models More commands and a detailed description of each command can be found here. Generate fake data Move to the util/data_gen folder inside dbt-hive-example and run the following command: python generate_data.py --days 2 --start-date 2022-01-01 This generates 2 days of fake data for the dates 01/01/2022 and 02/01/2022. You can modify the parameters like days and start-date to generate more fake data. Generated two files raw_covid__cases.csv and raw_covid__vaccines.csv can be found in the util/data_gen/data/ folder. Copy these two files to the seeds folder inside the dbt_hive_demo dbt project directory. Loading fake data To run the seeds, move to our dbt project dbt_hive_demo inside our cloned project dbt-hive-demo and run the following command: dbt seed The terminal should look like this: This will load the reference data into the warehouse. User can visualize this through hue UI Running tests: Our Seeds are configured with a couple of tests. Users can read more about it here. We also have a custom test created in dbt_hive_demo/tests/generic/test_length.sql Description: This test is used to check the character length of a column. Our reference data has columns that include ISO Alpha2 and Alpha3 country codes - we know that these columns should always be 2 or 3 columns respectively. To ensure that our reference data is high quality, we can use dbt to test these assumptions and report the results. We expect that Alpha2 and Alpha3 columns are the correct lengths and that no fields should be null. Run the following command to tests: dbt test The terminal should look something like this. Don’t be concerned about the errors. Running models: We have 3 sets of models in this demo project. Firstly, we have raw. Our raw models make use of Sources. This is data that already exists in our database that dbt needs to refer to. This is the fake data we loaded earlier. Our raw models are defined in models/raw/covid Next, we have staging. These are Models. Our staging models use the source() method to refer to the Sources we defined in our raw models. The staging models are intermediate views created over our raw data to handle some basic type conversion. These are materialized as views, and we don't expect our end users to query the staging models. Our staging models are defined in models/staging/covid Lastly, we have mart. These are Models. Our mart models use the ref() method to refer to the staging models and reference seeds we created using dbt. We use the staging views to handle most of the logic for type casting, but we do some renaming here to make our models easier for users to understand. These models are materialized as tables, as this gives greater performance for user queries. We can use incremental models to make the building of the model more performant. Our mart models are defined in models/mart/covid Run models with the following commands: dbt run The terminal should look like this: Reviewing the data and models Reviewing the database and table in Hue UI User should now have the following databases & tables: reference (database) ref__population (table) ref__country_codes (table) raw_covid (database) raw_covid__vaccines (table) raw_covid__cases (table) staging_covid (database) stg_covid__cases (view) mart_covid (database) covid_cases (table) In the raw, staging, and mart tables, you should see 2 days' worth of data for the dates 01/01/2022 and 02/01/2022. You can generate more fake data and play around. Generating and viewing dbt documentation: Run the following command to generate the documentation: dbt docs generate The terminal should look like this: Users can read more about dbt docs here Run the following command to generate the documentation: dbt docs serve --port 8001 The dbt docs serve command is only intended for local/development hosting of the documentation site. You can view it locally through http://localhost:8001/ or the port number you used to create the dbt docs serve. The terminal should look like this: Notes: dbt's documentation website was built in a way that makes it easy to host on the web. The site itself is "static", meaning that you don't need any type of "dynamic" server to serve the docs. Some common methods for hosting the docs are: Host on S3 Publish on Netlify Spin up a web server like Apache/Nginx Sample dbt docs site can be viewed here
... View more
07-21-2022
01:29 PM
Overview: This handbook focuses on Cloudera’s new adapter dbt-spark-cde that enables users to use dbt on top of Apache Spark by leveraging the use of CDE. Cloudera Data Engineering (CDE) is a serverless service for Cloudera Data Platform that allows you to submit Spark jobs to an auto-scaling cluster. Links to a dbt example project that can be used to bootstrap a user’s dbt journey can be found below. Version Notes: The adapter has been tested on the following version: python: 3.9 Cloudera Data Engineering release (1.15-h1) dbt-core: 1.3.0 Installing dbt and requirements Prerequisites: Make sure python3, git, and pip are available on your system. Cloning the repo: Run the following command in your terminal to clone the example project repository: git clone https://github.com/cloudera/dbt-spark-cde-example/ Installing the packages: Run the following commands in your terminal to install the required packages/dependencies using pip/pip3: pip install dbt-core pip install dbt-spark-cde CDE and Spark For our development and testing, we used Spark included in the Cloudera Data Engineering service. If you have not enabled the service, you can enable the service by referring to this link. https://docs.cloudera.com/data-engineering/cloud/enable-data-engineering/topics/cde-enable-data-engineering.html Create Cloudera Data Engineering Cluster Once you have an environment with CDE enabled the next step is to create a virtual cluster that can be used to run the spark jobs. You may create as many virtual clusters as you want. To create a virtual cluster you can refer to the link https://docs.cloudera.com/data-engineering/cloud/manage-clusters/topics/cde-create-cluster.html Once you have the CDE environment enabled and virtual cluster setup you should see something similar to this in your CDE overview page Connecting to Cloudera Data Engineering (CDE) from the adapter. To work with CDE we need to follow two steps. Create a machine user to run the queries. Cloudera Data Engineering uses JSON Web Tokens (JWT) for API authentication. To interact with a virtual cluster using the API, you must obtain an access token for that cluster. Step 1. Creating a machine user For this demo, we will use a Machine User, created inside CDP as the user running queries. Inside CDP > User Management, add a new Machine User and set the workload password. The steps for this are documented here: Creating a Machine User Setting the workload password With the user created and the workload password set, take a note of the Workload User Name & Workload Password. Notice in the below screenshot, that for a Machine User called ‘cia_test_user’ the workload username is ‘srv_cia_test_user’. Note: Keep the Workload User Name & Workload Password details handy for later. Step 2. Noting down endpoints for generating auth token. Navigate to the Cloudera Data Engineering Overview page by clicking the Data Engineering tile in the Cloudera Data Platform (CDP) management console. In the Services column, select the environment containing the virtual cluster you want to interact with. In the Virtual Clusters column on the right, click the Cluster Details icon(highlighted in red) on the virtual cluster you want to interact with. Click the link next to GRAFANA CHARTS. The hostname of the URL before /grafana/d… in your browser is the base URL. E.g. https://service.cde-cb6tw6r7.ciadev.cna2-sx9y.cloudera.site/ is the base URL or authend_endpoint. Note it down. Next, click the link next to the jobs API URL. This is our host endpoint. Note it down. E.g. https://x9b4ppfb.cde-cb6tw6r7.ciadev.cna2-sx9y.cloudera.site/dex/api/v1 Connecting dbt to Spark-CDE dbt needs a profile to connect to your data warehouse. The profile lives in a .dbt directory in your home directory and is called profiles.yml. On Unix-based systems, this would look like ~/.dbt/profiles.yml. If you haven't used dbt before, create the directory with the command: mkdir ~/.dbt and create the profiles.yml file with your favorite text editor. You can learn more about the dbt profile from the dbt docs here. Use the following template for the contents of the file: dbt_spark_cde_demo: outputs: dev: type: spark_cde method: cde schema: <schemaname> dbname: <dbname> host: <hostname> threads: <no of threads> user: <username> password: <password> auth_endpoint: https://service.cde-cb6tw6r7.ciadev.cna2-sx9y.cloudera.site/ target: dev NOTE: After you paste API URL in front of host, add a forward slash '/', else it will cause error. First, add your Workload User Name and Workload Password to the “user” and “password” fields. (refer to creating machine user) Next, add the auth_endpoint and host. (refer to endpoints for generating auth token) The sample completed profile looks like this: dbt_spark_cde_demo: outputs: dev: type: spark_cde method: cde schema: spark_cde_demo host: https://x9b4ppfb.cde-cb6tw6r7.ciadev.cna2-sx9y.cloudera.site/dex/api/v1/ user: srv_cia_test_user password: Password123! auth_endpoint: https://service.cde-cb6tw6r7.ciadev.cna2-sx9y.cloudera.site/ target: dev Passing the CDE session parameters with profiles.yml With our dbt-spark-cde 1.1.3 adapter, now user can pass the additional parameters as KEY:VALUE pair. A sample profiles.yml with additional parameters will look like this: dbt_spark_cde_demo: outputs: dev: type: spark_cde method: cde schema: spark_cde_demo host: https://x9b4ppfb.cde-cb6tw6r7.ciadev.cna2-sx9y.cloudera.site/dex/api/v1/ user: srv_cia_test_user password: Password123! auth_endpoint: https://service.cde-cb6tw6r7.ciadev.cna2-sx9y.cloudera.site/ cde_session_parameters: fs.s3.canned.acl: BucketOwnerFullControl fs.s3a.acl.default: BucketOwnerFullControl spark.hadoop.fs.s3a.acl.default: BucketOwnerFullControl fs.s3a.server-side-encryption-algorithm: AES256 fs.s3.server-side-encryption-algorithm: AES256 spark.kerberos.access.hadoopFileSystems: s3a://dev-dl-services-e1-us-east-1-509747 [some-other-key]: [value] target: dev Passed parameters are visible in CDE jobs UI. dbt debug In the output of this command, you should see the following: All checks passed similar to this. Running the demo project In the example repo, we have a sample dbt project called ‘dbt_spark_demo’. Inside this demo project, we can issue dbt commands to run parts of the project. The demo project contains examples of the following operations and commands: Generating fake data dbt seed dbt test dbt run More commands and a detailed description of each command can be found here. Generate fake data Move to the util/data_gen folder inside dbt-spark-example and run the following command: python generate_data.py --days 2 --start-date 2022-01-01 This generates 2 days of fake data for the dates 01/01/2022 and 02/01/2022. You can modify the parameters like days and start-date to generate more fake data. Generated two files raw_covid__cases.csv and raw_covid__vaccines.csv can be found in the util/data_gen/data/ folder. Copy these two files to the seeds folder inside the dbt_spark_cde_demo dbt project directory. Loading fake data To run the seeds, move to our dbt project dbt_spark_cde_demo inside our cloned project dbt-spark-cde-example and run the following command: dbt seed The terminal should look like this: This will load the reference data into the warehouse. Running tests: Our Seeds are configured with a couple of tests. Users can read more about it here. We also have a custom test created in dbt_spark_cde_demo/tests/generic/test_length.sql Description: This test is used to check the character length of a column. Our reference data has columns that include ISO Alpha2 and Alpha3 country codes - we know that these columns should always be 2 or 3 columns respectively. To ensure that our reference data is high quality, we can use dbt to test these assumptions and report the results. We expect that Alpha2 and Alpha3 columns are the correct lengths and that no fields should be null. Run the following command to tests: dbt test The terminal should look something like this. Don’t be concerned about the errors: Running models: We have 3 sets of models in this demo project. Firstly, we have raw. Our raw models make use of Sources. This is data that already exists in our database that dbt needs to refer to. This is the fake data we loaded earlier. Our raw models are defined in models/raw/covid Next, we have staging model. These are Models. Our staging models use the source() method to refer to the Sources we defined in our raw models. The staging models are intermediate views created over our raw data to handle some basic type conversion. These are materialized as views, and we don't expect our end users to query the staging models. Our staging models are defined in models/staging/covid Lastly, we have mart. These are Models. Our mart models use the ref() method to refer to the staging models and reference seeds we created using dbt. We use the staging views to handle most of the logic for type casting, but we do some renaming here to make our models easier for users to understand. These models are materialized as tables, as this gives greater performance for user queries. We can use incremental models to make the building of the model more performant. Our mart models are defined in models/mart/covid Run models with the following commands dbt run The terminal should look like this: Reviewing the data and models Reviewing the database and table in Hue UI User should now have the following databases & tables: reference (database) ref__population (table) ref__country_codes (table) dbt_spark_cde_demo_raw_covid (database) raw_covid__vaccines (table) raw_covid__cases (table) staging_covid (database) stg_covid__cases (view) mart_covid (database) covid_cases (table) In the raw, staging, and mart tables, you should see 2 days' worth of data for the dates 01/01/2022 and 02/01/2022. You can generate more fake data and play around. Generating and viewing dbt documentation: Run the following command to generate the documentation: dbt docs generate The terminal should look like this: Users can read more about dbt docs here Run the following command to open the documentation on the browser: dbt docs serve –port 8001 The dbt docs serve command is only intended for local/development hosting of the documentation site. You can view it locally through http://localhost:8001/ or the port number you used to create the dbt docs serve. The terminal should look like this: Notes: dbt's documentation website was built in a way that makes it easy to host on the web. The site itself is "static", meaning that you don't need any type of "dynamic" server to serve the docs. Sample dbt docs site can be viewed here. Some common methods for hosting the docs are: 1. Host on S3 2. Publish on Netlify 3. Spin up a web server like Apache/Nginx Debugging dbt job: We have made code improvement to offer user friendly errors and logs. Any dbt profiles errors or connection issues using dbt commands will show a user friendly messages: Details for each query is now available in the logs like this sample file dbt.log: Such detailed log will help user to analyze the time taken by each job to execute the query with CDE.
... View more
07-11-2022
04:13 PM
Overview This article focuses on Cloudera’s new adapter dbt-spark-livy that enables users to use dbt on top of Apache Spark by connecting to it using Apache Livy. Livy is a REST web service for managing long-running Spark Contexts and submitting Spark jobs. Instead of running the Spark Contexts in the Server itself, Livy manages Contexts running on the cluster managed by a Resource Manager like YARN. Cloudera Data Platform provides Spark with Livy in the Datahub form factor. Links to a dbt example project that can be used to bootstrap a user’s dbt journey can be found below. Version Notes: The adapter has been tested on the following version: python: 3.9.13 Cloudera Manager CM VERSION 7.6.2 RUNTIME VERSION 7.2.15 CDP Datahub with the template - 7.2.14 - Data Engineering: Apache Spark2/Spark3 dbt-core: 1.3.0 Installing dbt and requirements Prerequisites: Make sure python3, git, and pip are available on your system. Cloning the repo: Run the following command in your terminal to clone the example project repository: git clone https://github.com/cloudera/dbt-spark-livy-example/ Installing the packages: Run the following commands in your terminal to install the required packages/dependencies using pip/pip3: pip install dbt-core pip install dbt-spark-livy CDP and Spark For our development and testing, we used Spark included in Cloudera Runtime version 7. x.x with a data hub cluster. The cluster includes Spark, Livy, Hive, Zeppelin, and Oozie, along with supporting services (HDFS, YARN, and Zookeeper). More info on data hub clusters and templates at Cloudera data hub. You need a data cluster similar to this to be able to test the adapter. Also, note that because of using the datahub cluster there is a Hive Metastore Service (HMS) running in the cluster. Spark interacts with the Hive metastore which is already configured during the creation of the datahub. All datahub clusters are attached to a data lake that runs within an environment (in our case AWS) and provides security and governance for the environment's clusters. A Data Lake storage is created when a Data Lake is first instantiated for an environment. We have AWS S3 as our storage that hive metastore interacts with and uses. Setting up Cloudera Data Platform (CDP) Spark Datahub Cluster To work with Datahub Spark in CDP we need a machine user to run the queries. Creating a machine user For this demo, we will use a Machine User, created inside CDP as the user running queries. Inside CDP > User Management, add a new Machine User and set the workload password. The steps for this are documented here: Creating a Machine User Setting the workload password With the machine user created and the workload password set, take a note of the Workload User Name & Workload Password. Notice in the below screenshot, that for a Machine User called ‘cia_test_user’ the workload username is ‘srv_cia_test_user’. Note: Keep the Workload User Name & Workload Password details handy for later. Create Cloudera Datahub Spark cluster We start by creating a datahub as documented here Note: If you already have a datahub that has spark, livy, and hive services already created you can skip the creation procedure and note down the endpoints mentioned in step 4 below. For our use case, we choose a cluster that provides Spark. We went with 7.2.14 - Data Engineering: Apache Spark3. You can choose one of the default available clusters that includes spark, livy, and hive services to create this data hub or customize your own cluster. Once up and running you should have a datahub cluster similar to the below. Note: The steps are similar if you create a data hub providing Apache Spark(spark v2) as the service instead of spark3 (spark v3). With the data hub cluster in running state, you can view the details of the Data Hub by clicking on it. The livy server UI and other relevant addresses are accessible from the same page. Go to Endpoints and note down the Livy Server 3/Livy server endpoint depending on your version of the spark that you used to create the datahub cluster. This is the endpoint dbt will be connecting to Connecting dbt to Spark-Livy dbt needs a profile to connect to your data warehouse. The profile lives in a .dbt directory in your home directory and is called profiles.yml. On Linux, this would look like ~/.dbt/profiles.yml. If you haven't used dbt before, create the directory with the command: mkdir ~/.dbt and create the profiles.yml file with your favorite text editor. You can learn more about the dbt profile from the dbt docs here Use the following template for the contents of the file: dbt_spark_livy_demo: outputs: dev: type: spark_livy method: livy schema: <schemaname> dbname: <dbname> host: <hostname> threads: 2 user: <username> password: <password> target: dev First, add your Workload User Name and Workload Password to the “user” and “password” fields. Next, add the “host” field in the template we noted earlier from CDP UI (step 5 of Create Cloudera Data HUB Spark account) The sample completed profile looks like this: dbt_spark_livy_demo: outputs: dev: type: spark_livy method: livy schema: dbt_spark_livy_demo dbname: dbt_spark_livy_demo host: https://dbt-spark-2-gateway.ciadev.cna2-sx9y.cloudera.site/dbt-spark-2/cdp-proxy-api/livy/ threads: 2 user: srv_cia_test_user password: Password123! target: dev Note: Because spark uses hive metastore in the background remember to use proper syntax for dbname and schema name. Valid names should only include letters, numbers, and underscores {a-zA-z_0-9). No dashes or hyphens are allowed. To ensure we’ve configured our profile correctly, run a connection test. In the project directory, for us dbt_spark_livy_demo run the following command: dbt debug In the output of this command, you should see the following: All checks passed similar to this. Running the demo project In the example repo, we have a sample dbt project called ‘dbt_spark_demo’. Inside this demo project, we can issue dbt commands to run parts of the project. The demo project contains examples of the following operations and commands: Generating fake data dbt seed dbt test dbt run Sources View models & incremental table models More commands and detailed descriptions of each command can be found here Generate fake data Move to the util/data_gen folder inside dbt-spark-example and run the following command: python generate_data.py --days 2 --start-date 2022-01-01 This generates 2 days of fake data for the dates 01/01/2022 and 02/01/2022. You can modify the parameters like days and start-date to generate more fake data. Generated two files raw_covid__cases.csv and raw_covid__vaccines.csv can be found in the util/data_gen/data/ folder. Copy these two files to the seeds folder inside the dbt_spark_livy_demo dbt project directory. Loading fake data To run the seeds, move to our dbt project dbt_spark_livy_demo inside our cloned project dbt-spark-livy-example and run the following command: dbt seed The terminal should look like this: This will load the reference data into the warehouse. Users can visualize this through hue UI Running tests: Our Seeds are configured with a couple of tests. Users can read more about it here We also have a custom test created in dbt_spark_livy_demo/tests/generic/test_length.sql Description: This test is used to check the character length of a column. Our reference data has columns that include ISO Alpha2 and Alpha3 country codes - we know that these columns should always be 2 or 3 columns respectively. To ensure that our reference data is high quality, we can use dbt to test these assumptions and report the results. We expect that Alpha2 and Alpha3 columns are the correct lengths and that no fields should be null. Run the following command to tests: dbt test The terminal should look something like this. Don’t be concerned about the errors:: Running models: We have 3 sets of models in this demo project. Firstly, we have raw. Our raw models make use of Sources. This is data that already exists in our database that dbt needs to refer to. This is the fake data we loaded earlier. Our raw models are defined in models/raw/covid Next, we have staging. These are Models. Our staging models use the source() method to refer to the Sources we defined in our raw models. The staging models are intermediate views created over our raw data to handle some basic type conversion. These are materialized as views, and we don't expect our end users to query the staging models. Our staging models are defined in models/staging/covid Lastly, we have mart. These are Models. Our mart models use the ref() method to refer to the staging models and reference seeds we created using dbt. We use the staging views to handle most of the logic for type casting, but we do some renaming here to make our models easier for users to understand. These models are materialized as tables, giving a greater performance for user queries. We can use incremental models to make building the model more performant. Our mart models are defined in models/mart/covid Run models with the following commands: dbt run The terminal should look like this: Reviewing the data and models Reviewing the database and table in Hue UI User should now have the following databases & tables: reference (database) ref__population (table) ref__country_codes (table) raw_covid (database) raw_covid__vaccines (table) raw_covid__cases (table) staging_covid (database) stg_covid__cases (view) mart_covid (database) covid_cases (table) In the raw, staging, and mart tables, you should see 2 days' worth of data for the dates 01/01/2022 and 02/01/2022. You can generate more fake data and play around. Generating and viewing dbt documentation: Run the following command to generate the documentation: dbt docs generate The terminal should look like this: Users can read more about dbt docs here Run the following command to open the documentation on the browser: dbt docs serve –port 8001 The dbt docs serve command is only intended for local/development hosting of the documentation site. You can view it locally through http://localhost:8001/ or the port number you used to create the dbt docs serve. The terminal should look like this: Notes: dbt's documentation website was built in a way that makes it easy to host on the web. The site itself is "static", meaning that you don't need any type of "dynamic" server to serve the docs. Sample dbt docs site can be viewed here. Some common methods for hosting the docs are: Host on S3 Publish on Netlify Spin up a web server like Apache/Nginx
... View more