CDS 3.1 powered by Apache Spark 3.1.1 is now generally available for CDP Private Cloud Base 7.1.6. This is a minor release of CDS 3. The main improvements include: The parcel contains spark3 compliant spark-hbase connector All performance enhancement of Apache Spark 3.1.1 such as new optimizer rules and improved subexpression eliminations Unify create table SQL syntax Shuffled hash join improvement It can be installed as an add-on parcel on top of CDP Private Cloud Base 7.1.6. The bits can be found here and the latest documentation can be found here . For Public Cloud the same bits will be available in a pre-warmed image for Cloudera Runtime 7.2.9. That can be installed through Data Engineering Spark3 Data Hub cluster templates. -------------- Want to become a pro Spark user? Sign up for Apache Spark Training.
... View more
We are happy to announce General Availability of CDS 3.0 Powered By Apache Spark 3.0.1. You can download the parcel and apply it directly to provisioned clusters without disrupting your currently running Spark workloads, while taking advantage of all new features and benefits that come with Spark 3.0. This component is generally available and is supported on CDP Private Cloud Base clusters running version 7.1.3 and above. What's New in CDS 3.0? Support of JDK11, Scala 2.12, Python 3.4+ (Python 2.7+ deprecated) Adaptive execution of Spark SQL Dynamic Partition Pruning Binary files data source DataSource V2 Improvements (Pluggable catalog integration) Structured Streaming UI Auto discover and schedule tasks on nodes with GPUs on a YARN cluster Kafka connector delegation token (0.10+) See documentation for details and installation. -------------- Want to become a pro Spark user? Sign up for Apache Spark Training .
... View more
Cloudera Data Engineering is an integrated, purpose-built experience for data engineers. It delivers a streamlined service for scheduling, monitoring, debugging, and promoting data pipelines quickly & securely across the enterprise at scale.
The key to the experience is a centralized interface that simplifies the job management life cycle from scheduling, deploying, monitoring, debugging, and promotion which alleviate many of the challenges with running Spark jobs in production at scale. Similar to CML and CDW, CDE is cloud native leveraging Kubernetes where Platform admins can quickly provision virtual compute clusters with strong isolation, capacity auto-scaling and quotas for cost management.
For Platform Admins:
Managed Spark Service running on Kubernetes with mixed-version spark deployments accelerating DE workflows with zero setup. One click provisioning of new workloads with guardrails for CPU and Memory.
Data Governance and management through integration with SDX for security and visibility with automatic lineage capture without any code changes.
Monitoring of system services and utilization metrics through Grafana
CDP security integration that includes SSO with FreeIPA, Kerberos, Ranger, Knox, and Istio.
For Data Engineers:
Easy job deployment with configuration management, dependency artifacts, and spark tuning parameters
Apache Airflow-based scheduling service for orchestration of complex data pipelines with job dependencies.
Self-service visual troubleshooting and performance tuning of Spark jobs.
Rich API support for CI/CD and other automation use-cases. Accessible through CLI and REST API.
For more information:
Getting Started with Cloudera Data Engineering on CDP
Using CLI-API to Automate Access to Cloudera Data Engineering
... View more