Member since
07-15-2020
14
Posts
2
Kudos Received
0
Solutions
07-27-2022
07:34 PM
Key features for this release Airflow stability and performance. With this release we now use the latest stable Airflow release 2.2.5. In conjunction with service enhancements, our testing indicates improved stability under higher DAG and task loads. This allows DAG loads to reach up to 1000 per VC, combined with a concurrency of 200-300 parallel Airflow tasks within the same Service. Pre-loaded sample jobs and data for new users To help new users self-learn CDE, all new Virtual Clusters will have the option to load example Airflow & Spark jobs accompanied with sample data. Spark 3 support for raw scala code Previously this feature was limited to Spark 2, it is now extended to Spark 3 based Virtual Clusters. This allows users to directly run raw scala via API & CLI in batch-mode without having to compile, similar to what spark-shell supports Pipeline UI editor for Airflow is now GA with support for all major browsers (Firefox, Chrome, and Safari). New Virtual Clusters will have this feature enabled by default. Azure Private Storage now supported. New private storage options will appear in the Service creation wizard on Azure. Editing Virtual Cluster configurations post creation, allows adjusting CPU and memory quotas without having to recreate the VC. [Technical Preview] In-place upgrades are now supported for CDE services 1.14 and higher for both AWS and Azure. Note : This feature is behind an entitlement. For more details, please refer to the 1.16 Release notes here.
... View more
Labels:
04-19-2022
01:07 AM
Hello @SVK If your queries concerning Apache Airflow has been addressed, Feel free to mark the Post as Solved. If you have any further ask, Kindly share the same & we shall get back to you accordingly. Regards, Smarak
... View more
02-17-2022
04:58 PM
Apache Iceberg is a new open table format targeted for petabyte-scale analytic datasets. It has been designed and developed as an open community standard to ensure compatibility across languages and implementations. The integration of Iceberg with CDP's multi-function analytics and multi-cloud platform, provides a unique solution that future-proofs the data architecture for new and existing Cloudera customers. Users can define and manipulate datasets with SQL commands, build complex data pipelines using features like Time Travel operations, and deploy machine learning models built from Iceberg tables. By the end of the month, besides using Iceberg with ETL/ELT workloads, we will extend multi-analytic workloads to Iceberg tables in Cloudera Data Warehouse (CDW) with Hive and Impala for interactive BI and SQL analytics. This feature is in Preview and available on new CDE services only. L earn more here
... View more
Labels:
10-18-2021
05:07 PM
With Cloudera Data Engineering (CDE) Pipeline authoring UI, any CDE user irrespective of their level of Airflow expertise can create multi-step pipelines with combination of out of the box operators (CDEOperator, CDWOperator, BashOperator, PythonOperator). More advanced users can still continue to deploy their own custom Airflow DAGs (Directed Acyclic Graphs) as before, or use the Pipeline authoring UI to bootstrap their projects for further customization. And once the pipeline has been developed thru the UI, they are deployed and operationally managed through the same best in class APIs and job life-cycle management the users have come to expert from CDE. Airflow has been adopted by many Cloudera Data Platform (CDP) customers in the public cloud as the next generation orchestration service to setup and operationalize complex pipelines. Until now, the setup of such pipelines still required knowledge of Airflow and the associated python configuration. As a result users tended to limit their pipeline deployments to basic time-based scheduling of Spark jobs, and steered away from more complex multi-step pipelines that are typical of data engineering workflows. The CDE Pipeline authoring UI abstracts away those complexities from users, making multi-step pipeline development self-service and point-and-click driven. Providing an easier path than before to developing, deploying, and operationalizing true end-to-end data pipelines. This feature is in Preview and available on new CDE services only. When creating a Virtual Cluster a new option allows you to enable the Airflow authoring UI.
... View more
Labels:
10-18-2021
04:57 PM
Cloudera Data Engineering (CDE) in CDP Public Cloud introduces a new bin-packing auto-scaling policy for more efficient resource allocation and cost reduction. Customers with bursty data pipelines with overlapping schedules can reap the cost benefits of faster scale down through the use of the new bin-packing policy. Bin-packing policy is appropriate in situations where there's a mix of regular and bursty workloads deployed with some scheduling overlap. Usually, as nodes scale to meet the demand of a workload burst, there would be slow scale down because new jobs would be distributed across the new nodes preventing them from being freed up. Although this helps to reduce host spots, it leads to underutilization across the large number of nodes. With bin-packing as new jobs come online they are more efficiently allocated (ie - "bin-packed") to a smaller subset of nodes freeing the majority of the new nodes that were added. This feature is available by default. Learn more here.
... View more
Labels:
10-18-2021
04:55 PM
As a key component of Cloudera Data Engineering (CDE), Apache Airflow has served as a flexible orchestration service for data engineers and practitioners to develop and operationalize end-to-end data pipelines. Today, many customers use the managed Airflow service to avoid the administrative overhead of maintaining and tuning their own full-fledged scheduler. Instead they choose to rely on the out-of-the-box security and autoscaling compute enabled in CDE to deploy tens to hundreds of DAGs (Directed Acyclic Graphs) using CDE's job management APIs. And with integration with CDP data services, pipelines can flexibly tap into the efficient containerized compute of Spark in CDE and Hive in Cloudera Data Warehouse (CDW). With Airflow 2.1 as the new default managed scheduler, customers can continue to rely on the low administrative overhead they have come to expect while users can reap the benefits of the latest developments in the upstream community. As any major release, many aspects of Airflow have been enhanced including: scheduler speedup of up to 17x, a more optimized method for organizing tasks through task groups, a full UI refresh, and a new way of writing DAGs using the TaskFlow API. Airflow 2.1 as part of CDE comes with governance, security and compute autoscaling enabled out-of-the-box, along with integration with CDE's job management APIs giving users the flexibility to deploy custom DAGs that tap into Cloudera Data Platform (CDP) data services like Spark in CDE and Hive in CDW.
... View more
Labels:
09-21-2021
03:10 PM
Cloudera Data Engineering (CDE) now supports multi-version Spark pipelines. Users can easily test and promote Spark 2 workloads to Spark 3 to take advantage of the performance and stability improvement in the latest version of Spark. (Performance improvement of over 30% based on internal TPC-DS benchmarks) Data engineers can run workloads in both Spark 2 and Spark 3 within the same CDP PC environment, therefore maintaining backwards compatibility with legacy workloads while developing new applications on the latest version of Spark. Administrators have a new option within the Virtual Cluster creation wizard to choose a Spark version. Once up and running, users can seamlessly transition to deploying their Spark 3 jobs through the same UI and CLI/API as before, with comprehensive monitoring of their pipelines including real-time logs and Spark UI. To learn more, visit the documentation.
... View more
Labels:
07-12-2021
11:52 AM
Administrators of Cloudera Data Engineering (CDE) on CDP Public Cloud now have access to real-time as well as historical diagnostic log bundles without the need to access S3 or Kubernetes APIs. Bundles are streamed directly to the users local machine pre-packaged as zip files that can easily be uploaded to Support tickets. Previously administrators of CDE had to use kubectl to access service pods during troubleshooting scenarios, requiring privileged access and manual steps. Now, within each CDE service, two types of bundles are available, providing different metrics about the health of the underlying kubernetes cluster and service applications. The first bundle is a summary snapshot of the state of active pods. The second, allows the administrator to choose a time period along with specific service applications (such cluster autoscaler) to extract historical logs from. To learn more visit the documentation.
... View more
Labels:
06-11-2021
02:20 PM
Cloudera Data Engineering (CDE) now supports multi-version Spark pipelines. Users can easily test and promote Spark 2 workloads to Spark 3 to take advantage of the performance and stability improvement in the latest version of Spark. (Performance improvement of over 30% based on internal TPC-DS benchmarks) Data engineers can run workloads in both Spark 2 and Spark 3 within the same CDP PC environment, therefore maintaining backwards compatibility with legacy workloads while developing new applications on the latest version of Spark. Administrators have a new option within the Virtual Cluster creation wizard to choose a Spark version. Once up and running, users can seamlessly transition to deploying their Spark 3 jobs through the same UI and CLI/API as before, with comprehensive monitoring of their pipelines including real-time logs and Spark UI. Future release, will include the visual performance profiler into Spark 3 job run details. To learn more, visit the documentation.
... View more
Labels:
06-11-2021
02:18 PM
Users of Cloudera Data Engineering (CDE) on CDP Public Cloud can now deploy, monitor, and schedule data pipelines on Microsoft Azure. Customers can take advantage of a fully managed Spark-on-Kubernetes with autoscaling compute and guard rails to control cost, without the usual platform management overhead. As opposed to traditional job submissions mechanisms that require direct access to the cluster via edge nodes, data engineers can deploy data pipelines to autoscaling Virtual Clusters through simple, browser-based UI wizards or a full-fledged CLI and API. Once deployed, CDE optimizes execution through performance metric profiling and real-time monitoring, and provides users with a comprehensive view of their pipelines. And for further operationalization, a managed Apache Airflow service can be used to orchestrate complex pipelines on a schedule or based on event triggers. Future releases, will add support for Spot and SSD instances, as well as Private Link. To get started, visit the documentation . Supported Azure regions can be found here.
... View more
Labels:
03-25-2021
03:51 PM
CDS 3.1 powered by Apache Spark 3.1.1 is now generally available for CDP Private Cloud Base 7.1.6. This is a minor release of CDS 3. The main improvements include: The parcel contains spark3 compliant spark-hbase connector All performance enhancement of Apache Spark 3.1.1 such as new optimizer rules and improved subexpression eliminations Unify create table SQL syntax Shuffled hash join improvement It can be installed as an add-on parcel on top of CDP Private Cloud Base 7.1.6. The bits can be found here and the latest documentation can be found here . For Public Cloud the same bits will be available in a pre-warmed image for Cloudera Runtime 7.2.9. That can be installed through Data Engineering Spark3 Data Hub cluster templates. -------------- Want to become a pro Spark user? Sign up for Apache Spark Training.
... View more
09-24-2020
05:51 PM
1 Kudo
We are happy to announce General Availability of CDS 3.0 Powered By Apache Spark 3.0.1. You can download the parcel and apply it directly to provisioned clusters without disrupting your currently running Spark workloads, while taking advantage of all new features and benefits that come with Spark 3.0. This component is generally available and is supported on CDP Private Cloud Base clusters running version 7.1.3 and above. What's New in CDS 3.0? Support of JDK11, Scala 2.12, Python 3.4+ (Python 2.7+ deprecated) Adaptive execution of Spark SQL Dynamic Partition Pruning Binary files data source DataSource V2 Improvements (Pluggable catalog integration) Structured Streaming UI Auto discover and schedule tasks on nodes with GPUs on a YARN cluster Kafka connector delegation token (0.10+) See documentation for details and installation. -------------- Want to become a pro Spark user? Sign up for Apache Spark Training .
... View more
Labels:
08-28-2020
01:49 PM
Cloudera Data Engineering is an integrated, purpose-built experience for data engineers. It delivers a streamlined service for scheduling, monitoring, debugging, and promoting data pipelines quickly & securely across the enterprise at scale.
The key to the experience is a centralized interface that simplifies the job management life cycle from scheduling, deploying, monitoring, debugging, and promotion which alleviate many of the challenges with running Spark jobs in production at scale. Similar to CML and CDW, CDE is cloud native leveraging Kubernetes where Platform admins can quickly provision virtual compute clusters with strong isolation, capacity auto-scaling and quotas for cost management.
For Platform Admins:
Managed Spark Service running on Kubernetes with mixed-version spark deployments accelerating DE workflows with zero setup. One click provisioning of new workloads with guardrails for CPU and Memory.
Data Governance and management through integration with SDX for security and visibility with automatic lineage capture without any code changes.
Monitoring of system services and utilization metrics through Grafana
CDP security integration that includes SSO with FreeIPA, Kerberos, Ranger, Knox, and Istio.
For Data Engineers:
Easy job deployment with configuration management, dependency artifacts, and spark tuning parameters
Apache Airflow-based scheduling service for orchestration of complex data pipelines with job dependencies.
Self-service visual troubleshooting and performance tuning of Spark jobs.
Rich API support for CI/CD and other automation use-cases. Accessible through CLI and REST API.
For more information:
Getting Started with Cloudera Data Engineering on CDP
Using CLI-API to Automate Access to Cloudera Data Engineering
Online docs
... View more
Labels: