Member since
07-15-2020
15
Posts
2
Kudos Received
0
Solutions
06-23-2023
05:36 PM
Summary Cloudera Data Engineering (CDE) 1.19 introduces interactive Spark sessions for development workflows to take advantage of autoscaling compute and orchestration capabilities that's hybrid and multi-cloud ready. Since there is no one size fits all approach to development, CDE interactive sessions give data engineers flexible end-points to start developing Spark applications from anywhere -- in a web-based terminal, local CLI, favorite IDE, and even via JDBC from third-party tools. CDE exposes sessions as first-class entities via the APIs, as well as the UI and CLI, allowing users to navigate seamlessly across interfaces. For example, initiate the session through the UI, start interacting with it in the web-based shell, then drop into your local terminal for a spark-shell experience. Interactive Sessions Video Complete Feature List: Interactive Sessions (Tech Preview) Both CLI and web based interactive shell sessions are now supported. Users can run Python, Scala, and Java in interactive mode for exploration, development, and testing. Airflow performance In our latest benchmarks Airflow workloads run 2x faster on AWS, resulting from a combination of Airflow upgrades and continued optimizations New Workload Regions Hong Kong and Jakarta are now supported Addition of Spark 3.3 Moving forward, CDE will support multiple versions of Spark 3. Certain versions will be designated LTS to mirror PVC Base clusters to simplify migration - starting with Spark 3.2 LTS. Note that Spark 3.3 is only supported on Data Lake 7.2.16 version. Note that Spark 2.4 is now designated deprecated, and customers are encouraged to move to Spark 3 for better performance and longer support. Spark 2.4 will continue to receive security fixes but no new features. Airflow support for file-based resources (Technical preview) Airflow will now support mounting resources. In CDE 1.19, users will be able to mount file-based resources, future releases will extend this to include python libraries & virtual env. This is in Technical Preview and available through the CLI. Spark-submit migration tool The CLI translation tool is now available in the public cloud. Customers can download and install on Datahub edge nodes to start migrating jobs from Spark on DH to Spark on CDE. Profiles for CDE CLI Configure the CLI to easily toggle between different virtual clusters and CDE services. Additional Links 1.19 Release notes can be found here Pricing updates (note while in TP we will not charge the higher price).
... View more
Labels:
07-27-2022
07:34 PM
Key features for this release Airflow stability and performance. With this release we now use the latest stable Airflow release 2.2.5. In conjunction with service enhancements, our testing indicates improved stability under higher DAG and task loads. This allows DAG loads to reach up to 1000 per VC, combined with a concurrency of 200-300 parallel Airflow tasks within the same Service. Pre-loaded sample jobs and data for new users To help new users self-learn CDE, all new Virtual Clusters will have the option to load example Airflow & Spark jobs accompanied with sample data. Spark 3 support for raw scala code Previously this feature was limited to Spark 2, it is now extended to Spark 3 based Virtual Clusters. This allows users to directly run raw scala via API & CLI in batch-mode without having to compile, similar to what spark-shell supports Pipeline UI editor for Airflow is now GA with support for all major browsers (Firefox, Chrome, and Safari). New Virtual Clusters will have this feature enabled by default. Azure Private Storage now supported. New private storage options will appear in the Service creation wizard on Azure. Editing Virtual Cluster configurations post creation, allows adjusting CPU and memory quotas without having to recreate the VC. [Technical Preview] In-place upgrades are now supported for CDE services 1.14 and higher for both AWS and Azure. Note: This feature is behind an entitlement. For more details, please refer to the 1.16 Release notes here.
... View more
Labels:
02-17-2022
04:58 PM
Apache Iceberg is a new open table format targeted for petabyte-scale analytic datasets. It has been designed and developed as an open community standard to ensure compatibility across languages and implementations. The integration of Iceberg with CDP's multi-function analytics and multi-cloud platform, provides a unique solution that future-proofs the data architecture for new and existing Cloudera customers. Users can define and manipulate datasets with SQL commands, build complex data pipelines using features like Time Travel operations, and deploy machine learning models built from Iceberg tables. By the end of the month, besides using Iceberg with ETL/ELT workloads, we will extend multi-analytic workloads to Iceberg tables in Cloudera Data Warehouse (CDW) with Hive and Impala for interactive BI and SQL analytics. This feature is in Preview and available on new CDE services only. Learn more here
... View more
Labels:
10-18-2021
05:07 PM
With Cloudera Data Engineering (CDE) Pipeline authoring UI, any CDE user irrespective of their level of Airflow expertise can create multi-step pipelines with combination of out of the box operators (CDEOperator, CDWOperator, BashOperator, PythonOperator). More advanced users can still continue to deploy their own custom Airflow DAGs (Directed Acyclic Graphs) as before, or use the Pipeline authoring UI to bootstrap their projects for further customization. And once the pipeline has been developed thru the UI, they are deployed and operationally managed through the same best in class APIs and job life-cycle management the users have come to expert from CDE. Airflow has been adopted by many Cloudera Data Platform (CDP) customers in the public cloud as the next generation orchestration service to setup and operationalize complex pipelines. Until now, the setup of such pipelines still required knowledge of Airflow and the associated python configuration. As a result users tended to limit their pipeline deployments to basic time-based scheduling of Spark jobs, and steered away from more complex multi-step pipelines that are typical of data engineering workflows. The CDE Pipeline authoring UI abstracts away those complexities from users, making multi-step pipeline development self-service and point-and-click driven. Providing an easier path than before to developing, deploying, and operationalizing true end-to-end data pipelines. This feature is in Preview and available on new CDE services only. When creating a Virtual Cluster a new option allows you to enable the Airflow authoring UI.
... View more
Labels:
10-18-2021
04:57 PM
Cloudera Data Engineering (CDE) in CDP Public Cloud introduces a new bin-packing auto-scaling policy for more efficient resource allocation and cost reduction. Customers with bursty data pipelines with overlapping schedules can reap the cost benefits of faster scale down through the use of the new bin-packing policy. Bin-packing policy is appropriate in situations where there's a mix of regular and bursty workloads deployed with some scheduling overlap. Usually, as nodes scale to meet the demand of a workload burst, there would be slow scale down because new jobs would be distributed across the new nodes preventing them from being freed up. Although this helps to reduce host spots, it leads to underutilization across the large number of nodes. With bin-packing as new jobs come online they are more efficiently allocated (ie - "bin-packed") to a smaller subset of nodes freeing the majority of the new nodes that were added. This feature is available by default. Learn more here.
... View more
Labels:
10-18-2021
04:55 PM
As a key component of Cloudera Data Engineering (CDE), Apache Airflow has served as a flexible orchestration service for data engineers and practitioners to develop and operationalize end-to-end data pipelines. Today, many customers use the managed Airflow service to avoid the administrative overhead of maintaining and tuning their own full-fledged scheduler. Instead they choose to rely on the out-of-the-box security and autoscaling compute enabled in CDE to deploy tens to hundreds of DAGs (Directed Acyclic Graphs) using CDE's job management APIs. And with integration with CDP data services, pipelines can flexibly tap into the efficient containerized compute of Spark in CDE and Hive in Cloudera Data Warehouse (CDW). With Airflow 2.1 as the new default managed scheduler, customers can continue to rely on the low administrative overhead they have come to expect while users can reap the benefits of the latest developments in the upstream community. As any major release, many aspects of Airflow have been enhanced including: scheduler speedup of up to 17x, a more optimized method for organizing tasks through task groups, a full UI refresh, and a new way of writing DAGs using the TaskFlow API. Airflow 2.1 as part of CDE comes with governance, security and compute autoscaling enabled out-of-the-box, along with integration with CDE's job management APIs giving users the flexibility to deploy custom DAGs that tap into Cloudera Data Platform (CDP) data services like Spark in CDE and Hive in CDW.
... View more
Labels:
09-21-2021
03:10 PM
Cloudera Data Engineering (CDE) now supports multi-version Spark pipelines. Users can easily test and promote Spark 2 workloads to Spark 3 to take advantage of the performance and stability improvement in the latest version of Spark. (Performance improvement of over 30% based on internal TPC-DS benchmarks) Data engineers can run workloads in both Spark 2 and Spark 3 within the same CDP PC environment, therefore maintaining backwards compatibility with legacy workloads while developing new applications on the latest version of Spark. Administrators have a new option within the Virtual Cluster creation wizard to choose a Spark version. Once up and running, users can seamlessly transition to deploying their Spark 3 jobs through the same UI and CLI/API as before, with comprehensive monitoring of their pipelines including real-time logs and Spark UI. To learn more, visit the documentation.
... View more
Labels:
07-19-2021
09:46 AM
1 Kudo
Although not available in CDP Base. Airflow is provided through our Data Engineering Experience (CDE) within Public Cloud and Private Cloud. https://docs.cloudera.com/data-engineering/cloud/manage-jobs/topics/cde-airflow-dag-pipeline.html https://www.cloudera.com/products/data-engineering.html?tab=0
... View more
07-12-2021
11:52 AM
Administrators of Cloudera Data Engineering (CDE) on CDP Public Cloud now have access to real-time as well as historical diagnostic log bundles without the need to access S3 or Kubernetes APIs. Bundles are streamed directly to the users local machine pre-packaged as zip files that can easily be uploaded to Support tickets. Previously administrators of CDE had to use kubectl to access service pods during troubleshooting scenarios, requiring privileged access and manual steps. Now, within each CDE service, two types of bundles are available, providing different metrics about the health of the underlying kubernetes cluster and service applications. The first bundle is a summary snapshot of the state of active pods. The second, allows the administrator to choose a time period along with specific service applications (such cluster autoscaler) to extract historical logs from. To learn more visit the documentation.
... View more
Labels:
06-11-2021
02:20 PM
Cloudera Data Engineering (CDE) now supports multi-version Spark pipelines. Users can easily test and promote Spark 2 workloads to Spark 3 to take advantage of the performance and stability improvement in the latest version of Spark. (Performance improvement of over 30% based on internal TPC-DS benchmarks) Data engineers can run workloads in both Spark 2 and Spark 3 within the same CDP PC environment, therefore maintaining backwards compatibility with legacy workloads while developing new applications on the latest version of Spark. Administrators have a new option within the Virtual Cluster creation wizard to choose a Spark version. Once up and running, users can seamlessly transition to deploying their Spark 3 jobs through the same UI and CLI/API as before, with comprehensive monitoring of their pipelines including real-time logs and Spark UI. Future release, will include the visual performance profiler into Spark 3 job run details. To learn more, visit the documentation.
... View more
Labels: