About rhryniewicz

rhryniewicz · ‎11-20-2020

This article contains Questions & Answers on Cloudera Data Engineering (CDE). Is Cloudera Data Engineering integrated with Cloudera Data Warehouse (CDW) and Cloudera Machine Learning (CML)? Yes, through SDX generated datasets from your Data Engineering pipelines will automatically be accessible in downstream analytics like Data Warehousing and Machine Learning. In additionally all the benefits of SDX including lineage to secure your data pipelines across your enterprise. I already have Cloudera Machine Learning (CML) that has Spark capabilities. How is this better? Cloudera Data Engineering is tailor built for data engineers to operationalize their data pipelines. ML is tailored to data scientists who want to develop and operationalize their ML models. Both services are fully integrated with each other and seamlessly interoperable for however you want to run your data engineering and data science workflows. Because they’re both included with CDP, there’s no extra purchase necessary to use one or the other — both are based on consumption by the hour so you only pay for what you use. Is there an orchestration/scheduling tool within CDE? Yes. We have a managed Apache Airflow scheduling and orchestration service natively in the platform. This offers superior capabilities compared to existing tools in the market, which we’ve extended even further for automation and delivery through our rich set of APIs Where does a Data Engineer write code? Does CDE provide notebooks to develop pipelines as well? CDE supports Scala, Java, and Python code. It is flexible that any jobs you have developed in your favorite IDE locally or through 3rd party tools can be deployed through a rich job management APIs. CDE offers CLI to submit jobs security from your local machine or using REST APIs to integrate with CI/CD workflows. And with Cloudera Machine Learning (CML) you can develop with notebooks without leaving the CDP ecosystem and operationalize them in CDE.

rhryniewicz · ‎07-21-2020

This article contains Questions & Answers on Cloudera Operational Database (COD). What is the relationship between Data Hub & Data Lake? Data Lake houses SDX (governance & authorization). Data Hub is the actual service that hosts the workload, in this case, the Operational DB. When should I use Cloudera OpDB vs a template in Data Hub? For new apps use Cloudera OpDB (COD) that is self-tuning and auto-improves performance over time. And for replicating on-prem environments to the cloud via lift and shift or for disaster recovery use Data Hub templates. How does Apache Phoenix relate to Apache HBase? Phoenix is an OLTP SQL engine for OpDB. It adds relational capabilities on top of HBase. Phoenix provides a much more familiar programming paradigm and allows our customers to reach production faster. Think of Phoenix as a SQL persona and HBase as a NoSQL persona. Should I use HBase/Phoenix or Apache Kudu for an operational data store (ODS) / Operational database? The Cloudera Operational Database is powered by HBase and Phoenix Kudu is part of our data warehouse offering that allows you to do real-time analytics on streaming data. Just like you have to make a choice between an operational database and a data warehouse based on what you want to do, you similarly need to decide between OpDB & Kudu. Both systems support real-time ingest of streaming and time-series data. OpDB is the platform you want to use if you are building applications. If you are building dashboards or doing ad-hoc analytics, then Kudu will be a better choice. What’s the frequency for replication? What’s the granularity? Replication with Replication Manager for OpDB in near-real-time and is eventually consistent. There’s no waiting period and no scheduling required. And granularity is an option, you can choose a table or a namespace (akin to a more traditional DB).

rhryniewicz · ‎07-21-2020

This article contains Questions & Answers on Cloudera Data Warehouse (CDW) demos. Which data warehousing engines are available in CDW? Hive for EDW and complex report building and dashboarding, Impala for interactive SQL and ad hoc exploration, Kudu for time-series and Druid for log analytics. How do you create data warehouses? Step 1, create your CDP environment. Step 2, activate the CDW service. Step 3, create your virtual warehouse. Step 4, define tables, load data, run queries, integrate your BI tool, etc. What’s the relationship between a Database Catalog and Virtual Warehouse? For each database catalog there can be one or more virtual warehouses. But each virtual warehouse is isolated from other warehouses while they share the same data and metadata. What’s the CDW query performance in the cloud when using remote storage? CDW uses different levels of caching to offset object storage access latency. This includes a data cache on each query execution node, a query result cache (for Hive LLAP), and Materialized Views (for Hive LLAP). Does the virtual warehouse experience cluster have local storage cache? Yes, local SSDs are used as a caching mechanism between object store and compute. For query performance. 600GB per each executor node. How does CDW handle workloads with high concurrency? One, by designing a query engine to be as efficient as possible to maximize the number of concurrent queries, e.g. with runtime code generation. Two, if you know you will have a lot of concurrent queries you can choose a larger virtual warehouse size, e.g. medium or large. Three, utilize auto-scaling to spin up more nodes (by creating new executor groups) as query concurrency increases. What tools are available for tuning and troubleshooting CDW? CDP Workload Manager collects telemetry information from Hive and Impala queries and Spark jobs, then profiles them. It then automatically identifies errors and inefficiencies and makes suggestions for resolution. This tool helps with troubleshooting and data lifecycle optimization. What’s the best way to run diagnostics in CDW? All the log files are stored in object storage, e.g. S3. We also have Grafana to access all the metrics, e.g. what’s running on different executors.

rhryniewicz · ‎07-21-2020

This article contains Questions & Answers on Cloudera Machine Learning (CML). Is it possible to run CML on premises? Yes, Cloudera Machine Learning is available on both CDP Public Cloud as well as CDP Private Cloud. How is the deployment of models managed by CML Private Cloud vs CML Public Cloud? Model deployment functions similarly across both form factors of CML; the models are built into containerized images, and then deployed on top of Kubernetes pods for production-level serving. Is there a level of programming required for a data scientist to use this platform? What languages can developers use? CML enables data scientists to write code in Python, R, or Scala in their editor of choice. Therefore, beginner data scientists can easily run sample code in the workbench, and more experienced data scientists can leverage open source libraries for more complex workloads. Can you run SQL-like queries? E.g. with Spark SQL? Yes, Spark SQL can be run from CML Do pre-built models come out of the box? While CML does not have built-in libraries of pre-built models, CML will soon come with reusable and customizable Applied Machine Learning Prototypes. These prototypes are fully built ML projects with all the code, models, and applications for leveraging best practices and novel algorithms., Additionally, CML is a platform upon which you can leverage the ML libraries and approaches of your choice to be able to build your own models. Can I do AutoML with CML? CML is designed to be the platform on top of which data scientists can bring the Python, R, and Scala libraries and packages they need to run their workloads. This includes being able to leverage open source technologies, such as AutoML, to be used within the Projects in CML. In addition, Cloudera is working with partners such as H2O to be able to further enable data scientists with specific AutoML distributions, as well as citizen data scientists who are looking for a more interactive ML experience. What is your MLOps support? CML’s MLOps capabilities and integration with SDX for models bring prediction and accuracy monitoring, production environment ground truthing, model cataloging, and full lifecycle lineage tracking. Can the result/output of the ML model be available in CSV or Excel file for the business user to use it in a different platform? Yes, you can certainly ensure that the output of models is available in the external format of your choice. What about multiple file model projects? CML lets you deploy multiple models in a project and allows for complex dependency management through the analytical job scheduling functionality. What about model access monitoring? Does CML log directly all REST access? Yes, all access to the models is logged. CML’s MLOps also enables fine-grained tracking of model telemetry for monitoring drift and quality. We have also implemented comprehensive security mechanisms on-top of models so that each request can be comprehensively audited. Does CML support automated model tuning? Yes. CML supports AutoML, Teapot, and other automation frameworks. CML also has a comprehensive API for managing experiments, models, jobs, and applications. MLOps brings tracking and monitoring metrics in model build and deployment so that model performance and drift can be managed. CML Jobs can then be used to retrain models if their performance falls outside the desired range. What technologies does CDP CML use? Mahout, TensorFlow, others? CML takes a bring your own framework approach. We support Scala, Python, and R frameworks by default, so libraries such as TensorFlow, Dask, Sparklyr simply need to be installed to be usable. Do Jupyter notebooks come with CML? Yes, data scientists can use Jupyter notebooks on top of CML. In addition, CML also has the flexibility to enable and use other editors, e.g. R Studio or PyCharm. Do R programs for CML also run in parallel on the CDP? Yes. CML supports R and can be run in parallel on CDP using the sparklyr library. How are the Python packages handled in CML? You are able to install your own Python libraries and packages into your session. Either via Jupyter terminal or via the built-in editor with PIP or Conda install. How easy it is to spin up and down different environments/workspaces? From the CDP Management Console, it only takes a few clicks and a few minutes to be able to spin up and down different workspaces. When a session ends. Do packages have to be re-installed? No. The packages are saved with the project and shared between sessions. However, different projects will not share the same packages, thus keeping the environments separate. Is Spark being used as part of the platform or part of CML? Cloudera Machine Learning (CML) leverages Spark-on-K8s, enabling data scientists to directly manage the resources and dependencies necessary for the Spark cluster. Once the workload is completed, the Spark executors are spun down to free up the resources for other uses. Can data scientists bring their own versions of Spark, Python, or R? The core engine will have the latest versions of Spark, Python, and R, but you can further customize these engines and make those available to your data scientists. Shall the engine profile be set by admins and disabled for data scientists? Admins manage the engines available across the ML Workspace and data scientists choose which engine they need to use for each Project. Can data scientists create their own workspaces? Generally, it’s the data science admins who would create and manage these workspaces. However, it is possible to enable data scientists to do so as well through permissions. How is data access handled in CML? Data access is centralized and managed through its integration with Cloudera’s SDX. For example, in Spark, if I access data from a Data Warehouse or Kafka topic, the SDX services will determine my permissions to do so, apply the masking and filtering policies, and then fully audit and record the lineage of the access. Do ID broker rules apply to Machine Learning experience as well? Yes, they do. How is CML different in CDP vs CDH (i.e. CDSW)? CML expands the end-to-end workflow of Cloudera Data Science Workbench (CDSW) with cloud-native benefits like rapid provisioning, elastic autoscaling, distributed dependency isolation, and distributed GPU training. In addition, CML operates on top of CDP Public Cloud and CDP Private Cloud, while CDSW operates on top of CDH and HDP. More details can be found in our documentation here.

rhryniewicz · ‎07-21-2020

This article contains Questions & Answers on Cloudera DataFlow (CDF). What is the difference between Cloudera’s NiFi and Apache NiFi? Cloudera Flow Management (CFM) is based on Apache NiFi but comes with all the additional platform integration that you’ve just seen in the demo. We make sure it works with CDP’s identity management, integrates with Apache Ranger and Apache Atlas. The original creators of Apache NiFi work for Cloudera. Does NiFi come with CDP public cloud or is it an add on? Any CDP Public Cloud customer can start using NiFi by creating Flow Management clusters in CDP Data Hub. You don’t need additional licenses and you will be charged based on how many instance hours your clusters consume What types of read/write data does NiFi support? NiFi supports 400+ processors with many sources/destinations. Do you support cloud-native data sources and sinks? Yes, we support a lot of cloud-native sources and sinks with dedicated processors for AWS, Azure, and GCP. It allows you to interact with the managed services and object storage solutions of the cloud providers (S3, GCS, ADLS, BlobStorage, EventHub, Kinesis, PubSub, BigQuery, etc). When you load data into object storage, what details do you need to know? Two options. One, putting data into an object store under CDP control is simple. u You need to know where you want to write to and your CDP username + password... Two, if you use an object store outside of CDP control, then you need to use the cloud connector and specifics about the authentication method that you want to use. How does using NiFi in the upcoming DataFlow service differ from using NiFi as a Flow Management cluster on CDP Data Hub? CDP Data Hub makes it very easy to create a fully secure NiFi cluster using the preconfigured Flow Management cluster definitions. These clusters run on Virtual Machines and offer the traditional NiFi development and production experience. CDP DataFlow Service on the other hand focuses on deploying and monitoring NiFi data flows. It takes care of deploying the required NiFi infrastructure on Kubernetes, providing auto-scaling and better workload isolation. Running NiFi in CDP DataFlow Service will be ideal for NiFi flows where you expect bursty data. Since the service manages the underlying cluster lifecycle, you can focus on developing and monitoring your data flows. How does DataFlow compare with the tools I can use from cloud vendors? Flow Management is based on Apache NiFi which is not available from any other cloud vendor. In addition to CDP being the only cloud service provider for Apache NiFi, our additional Streams Messaging and Streaming Analytics components are tightly integrated with each other allowing centralized security policy management and data governance. This is powered by Cloudera SDX and helps to understand the end-to-end data flow across the entire Cloudera Flow portfolio + other CDP components like Hive or Spark jobs. How are policies set on S3 buckets in AWS by CDF? An ID broker lets you map your internal CDP users to your internal IAM roles. So the mapping is for a specific user to a specific role that allows them to then access a specific S3 bucket. Can you load data into Hive from NiFi? Yes. Just use the PutHiveStreaming NiFi processor and set a few parameters. Can I access my own internal API with NiFi? Yes. You can write a custom NiFi processor in Java or use an HTTP processor. Is MiNiFi support part of NiFi support? Support for MiNiFi comes with Cloudera Edge Management (CEM). It’s not only MiNiFi but also includes Cloudera Edge Flow Manager which allows you to design edge flows centrally and push them out to thousands of MiNiFi agents you’re running. This is currently offered independently of CDP and we’re working on bringing it into the CDP experience as well. See this link for more info. What platforms can I run the MiNiFi agent on? Virtually any hardware or device where you can run a small C++ or Java application. How do I get MiNiFi on my devices? You’ll need to install MiNiFi on the devices that you need to monitor. MiNiFi is part of Cloudera Edge Management and comes with the Edge Flow Manager tool allowing you to design flows in a central place and push them out to all your agents at the same time. Is there a way of versioning the data flows? Yes. NiFi comes with the NiFi Registry that lets you version flows. In Data Hub this is set up automatically for you. Where is NiFi in-transit data stored? NiFi stores data that is flowing through in so called ‘repositories’ on local disk. In the example that was running on AWS, the NiFi instances have EBS volumes mounted where all that data is stored. NiFi also stores historic provenance data on disk so you can look up details and lineage of data long after it has been processed in the flow. If the data ingested has a record updated, does it come back ingested as a new entry, or with PK it gets updated? Depends on your data ingest pipeline. NiFi is able to pick up updated records and move them through its data flow. If you are sending records to Kafka, it doesn’t really care whether the record is an update or not but the downstream application would have to handle this. If you’re using Hive, you can use the Hive3Streaming processor in NiFi which is able to handle upserts. If Kudu is your target, upserts are also supported. Does NiFi have a resource manager for different components of its pipeline? By default, all NiFi nodes process data and NiFi is optimized to process data as quickly as possible. So it makes use of all resources that are given to it. It currently does not have an internal resource manager to assign resources to a specific flow. Going forward we’ll be running flows in their own clusters on Kubernetes to improve this experience. Is it possible to use metadata from Atlas in NiFi? Currently, Atlas is used to capture NiFi data provenance metadata and to keep it up to date. When using Atlas is there a manual setup required to use NiFi and Kafka in CDF? No. That’s the benefit of using CDF on top of the Cloudera Data Platform (CDP) public cloud. How do I connect my own Kafka producers/consumers to a Streams Messaging cluster in CDP Public Cloud Data Hub? You can connect Kafka clients to Streams Messaging clusters no matter where your clients are running. We have published detailed instructions here. What's the best way to extend an existing Kafka deployment on-prem to the public cloud with CDP? If you have an existing Kafka cluster on-premises, the best way to extend it is 1. Create a Streams Messaging cluster in CDP Public Cloud 2. Use Streams Replication Manager to setup replication between the two environments The replication can be from on-prem to cloud, vice versa or even bidirectional. Check out this Streams Replication Manager doc for more info. What is the value of having Atlas for provenance when NiFi already has data provenance built-in? NiFi data provenance captures what is happening in NiFi to a very detailed level. It shows you the actual data flowing through. Atlas covers lineage on a data set level so it doesn’t contain the detailed records but rather shows you the end-to-end lineage. So you’ll see data lineage through your entire pipeline across NiFi, Hive, Kafka, and Spark. Once a stream is processed, how can I consume this data with analytics or reporting tools from on-premise? Depends on your pipeline. You could use NiFi to write to a data store on-prem where you already have your analytic tools connected. If you are writing data to the cloud, you can configure your analytic/reporting tools to access that data there. What is the best option to serve trained ML models for streaming data? From within NiFi or Flink? Both options are possible. From within a NiFi flow, you can call out to a trained model in the Cloudera Machine Learning (CML). And Flink lets you embed ML models directly in the stream. Is NiFi good for complex transformations? Depends on how complex 🙂 Generally, though, as complexity increases, Flink and Spark Streaming are a better fit. Can you use NiFi for real-time as well as batch processing? Yes. Both event driven and batch processing modes are possible in NiFi. What is the minimum number of nodes needed for a Data Flow cluster? The number of nodes is configurable, but we have defaults for heavy and light duty clusters for both Flow Management and Streams Messaging. See details of this node sizing and layout in the documentation: Flow Management cluster layout and Streams Messaging cluster layout. In the case of NiFi node failure, does the data in the processing of this node automatically recover on another NiFi node? In CDP Data Hub, yes. Since the data is stored on EBS volumes, we will replace the instance if it fails, and reattach the EBS volume to the new instance. So we automatically replace failed nodes and reattach the volume to the new NiFi node so that processing is picked up immediately after we have recovered from a failure. Is there a GUI to track/monitor all the NiFi error flows? Today you can do it with a ReportingTask sending information to your reporting tool of choice or a secondary database or Kafka. We are looking to release alerting and monitoring features in the next 6-12 months for public/private cloud that will work natively out of the box. Does NiFi give a notification or email facility for a possible bottleneck or threshold reached a point in the workflow? Yes, you can send email alerts based on failures in your NiFi flow. You can also send metrics etc. to external systems like DataDog or Prometheus Can I create alerts on Apache Kafka topics? Yes, with Streams Messaging Manager (SMM) it’s easy to define alerts to manage the SLAs of your applications. SMM provides rich alert management features for the critical components of a Kafka cluster including brokers, topics, consumers, and producers. You can set alerts, for example, on consumer lag of data in/out or whenever thresholds are exceeded. Can you expose alerts in Streams Messaging Manager? Can you expose these via email, for example? Yes, you can send alerts via notifier to an email or via an HTTP endpoint to any monitoring system you may have that accepts an HTTP request. Is there a central way to track authentication failures? Yes, as an admin you can use an audit view in Ranger for all authorization requests. You can then track all allowed or denied requests of your Kafka clients across the enterprise. How do you reroute failed messages? If you want to handle errors, you would just connect error relationships to one process group that would then handle your errors. You could then apply corrections to these failed events and try to re-process it. And if you’re working with Kafka, your events are always safe in Kafka, then you can always reprocess these events again. Where are parameters set in NiFi? Parameters are configured in the parameter context that can hold a list of parameters and their values. You can then assign a parameter context to a specific processor group.

rhryniewicz · ‎07-21-2020

This article contains Questions & Answers on CDP Security and Governance (SDX). Does SDX provide governance for all the data in the cluster that’s in the cloud? Yes. Any cluster that has been built with CDP will have governance applied to it, regardless of whether it’s deployed to a public or private cloud. How do we enable end to end governance on CDP? Is it different for public cloud and private cloud? End-to-end governance is enabled by default in CDP public cloud and private cloud. Under the hood, Apache Ranger and Apache Atlas are out-of-the-box wired into all the processing engines in CDP (e.g. Apache Hive, Impala, NiFi, Spark et al.). How do you set up SDX? SDX installation is completed when you provision an environment with wire and at-rest encryption preconfigured. Technical Metadata management functionalities are also set up automatically. Business Metadata and data policies must be implemented according to the customer’s context and requirements. How do you get an SDX license? How much does it cost? At present, SDX is part of CDP and not licensed separately. Is the data catalog only available in CDP public cloud? Today, Yes. The data catalog capabilities are available in both CDP public and can manage data from Azure and AWS. Data Catalog is on the roadmap for the private cloud. The core lineage and governance capabilities are available in CDP Private Base using Apache Atlas and Ranger directly. It appears SDX is great for security and data classification. What about data quality? Customers can leverage both Cloudera’s extensive partner ecosystem as well as the native CDP capabilities to tailor data quality capabilities to their specific requirements. A recent addition to the Data Catalog on this front is the Open Data Profiling capability. Any pre-built rules/policies that come with SDX? We provide common pre-built data profilers that look for common GDPR and PII values and pre-defined tags with Data Catalog. Data governance teams can either leverage these standard capabilities or customize their own to implement their specific data access and governance policies. Is tagging done automatically? Profiler has multiple tagging profiles that can be applied automatically, however the customer can modify or create their own profilers and associate tags based on the profiling criterion. Data stewards can also trigger the execution of specific profilers against specific files. Will tagging be migrated from CDH to CDP? Users upgrading clusters from CDH to CDP Private Base will have Navigator’s properties and managed metadata tags automatically migrated to the Atlas’s equivalent properties and business metadata tags. The policies in SDX are great. What are some other SDX capabilities that I should be aware of? Audit reports (API driven), asset search, end-to-end lineage, profilers; and soon: ML model registry, ML feature registry, schema registry, NiFi registry – all meta stores are consolidated in one place with SDX. Where are data privacy policies stored? Are they in a central location? Can you audit them? All data access and data privacy policies are stored in Ranger which is part of each CDP deployment. Data Catalog sits on top of Ranger and Atlas and pulls data from both of these repositories. In the public cloud all the audit logs would be stored on either the customer’s AWS S3 object store or Azure’s ADLS. Can you migrate Sentry policies to Ranger, i.e. from CDH to CDP? Yes. You can migrate them using CDP Replication Manager for migrations. The Sentry policies are also converted automatically into Ranger equivalents during in-place upgrades. How is the SDX integrated with Kubernetes? Does it use Knox? All the services within CDP run on virtual machines directly or within Kubernetes. Currently, Data Hub clusters and the SDX data lake cluster use traditional virtual machines without Kubernetes. However, the cloud-optimized experiences – Cloudera Data Warehouse and Cloudera Machine Learning – are Kubernetes applications that integrate with and use the traditional CDH/HDP distribution components. All of these services are deployed, hosted inside your VPC, and deployed out of the box to protect and proxy into the VPC using Apache Knox. How are Ranger and Atlas incorporated into CDP? Ranger and Atlas are open source projects and are incorporated into all of CDP’s services automatically, i.e. they are integrated with all of the CDP components out of the box: Apache Solr collections, HBase tables, Hive/Impala tables, Kafka topics, and NiFi flows. Can you explain the difference between Atlas and the Data Catalog? The Cloudera Data Catalog provides a single pane of glass for administration and use of data assets spanning all analytics and deployments. The Data Catalog surfaces and federates the information from the various Atlas instances running in each of the various CDP environments. For example, if you have an environment in AWS in Virginia and an Azure environment in the EU, each would have their own Altas instance and the Cloudera Data Catalog would talk to both to present information in a single interface. Altas is effectively the backend, collecting lineage information, and helping enforce policy, while the Data Catalog is the front-end that the data users use to navigate, search for, and steward the data. How do the managed classifications help with securing data? The combination of tags/classifications and tag based data protection policies enable you to restrict access to the data in the tagged assets. When you tag or add management classifications to assets such as tables or columns in CDP, corresponding tag-based data protection policies associated will enforce access and even mask data. What's more is that tags propagate to derived assets automatically along with the enforcement of tag-based policies. Is it possible to customize the sensitive data profilers? Yes. See this doc. How does CDP work with user-authentication? Good security starts with identity. If you can’t assert who someone is, it doesn’t matter what the policies say they can or cannot do. The CDP Public Cloud platform brings the users’ on-prem identity into the cloud via SAML, a protocol that does not require direct network connectivity to a customer identity provider to assert identity. Users and groups are propagated throughout the entire platform, so access policies can be consistently applied to datasets and access points across the complete CDP platform. This helps enterprise infosec and governance teams centrally manage their employee directory and organizational groupings in their existing identity management system, and ensures security policies are automatically applied in the CDP platform as employees shift around organizations or new members are onboarded. Is the security data masking and encryption applied to Kudu as well? Yes. Data masking can be applied to Kudu through Impala. Access control and column masking are controlled via centralized Ranger policies. Kudu supports wire encryption for protecting data in motion. Volume encryption can be used for data at rest. I understand the Data Catalog/tagging feature works in Hive/Impala queries. What if I access my data/table from within Spark via Python or Scala? Will the permission/tagging rules still apply? Yes. SDX provides a consistent view of the data. Assuming you access the data though the correct Spark API, you will view the data in the same masked/tagged way. Is there any row-level security approach? Yes. Ranger provides row-level security in CDP and allows for filters to be set on data according to users and groups as well as based on the content of fields within the rows. Would I be able to use the Ranger Access Control rules/schema in custom applications? i.e. be able to use the same access control uses in my operational systems? Yes, Ranger has open APIs. Also, the identities and groups used in CDP can be federated with your organization’s identity system (Okta, Microsoft AD, etc). If your operational systems are also using that same identity, it is possible to leverage that same data security based user identity. On the masking features: is the masking applied on the data on-the-fly when query is executed? And besides masking, can we apply encryption or hashing to ensure the uniqueness of the data? Yes, masking is applied to the data on the fly. You can also use all the encryption functions in Hive 3. See this link for more info. Can I use AD or LDAP policies? CDP supports integration with Identity Management services like AD through SAML.

rhryniewicz · ‎07-21-2020

This article contains Questions & Answers on Cloudera Data Platform (CDP) - private or public cloud. CLOUDERA DATA PLATFORM Which clouds is CDP public supported on? AWS, Azure and, soon, GCP. What’s the difference between Cloudera’s product and cloud providers? Cloudera Data Platform (CDP) is a Platform-as-a-Service (PaaS) that is cloud infrastructure agnostic and easily portable between multiple cloud providers including private solutions such as OpenShift. CDP is both hybrid and multi-cloud from the ground up which means one platform can serve all data lifecycle use cases, independent of location or cloud, with a unified security and governance model. How do CDP experiences compare to solutions from other cloud service providers? CDP has an SDX layer that stores all policies and metadata for security and governance. This preservation of state is the big differentiating factor, especially when running transient workloads and a variety of experiences. The SDX layer is present across the entire data lifecycle. Is CDP a completely separate platform or is it merged with AWS? CDP is a platform that can run in the public cloud, such as AWS, Azure, and, soon, GCP as well as a private cloud running on RedHat’s OpenShift. Can CDP run in a Kubernetes environment? Can it be deployed both on-prem and in the cloud? Yes, the CDP Experiences run on the Cloud provider’s Kubernetes offerings (e.g. EKS and AKS) but also on RedHat’s OpenShift in an on-prem world. Does CDP support auto-scaling? Yes. All our experiences support auto-scaling specific to workloads. For example, Cloudera Data Warehouse will auto-scale for concurrency as a lot of users run queries on the cluster so it’ll scale up to support that task, which is common with data warehouses. Are the autoscaling headroom and the other configurations of CDP restricted per team? CDP has multiple levels of privileges that correspond to the different abstractions inside CDP. Today the ability to allocate isolated resources and experiences in an environment requires the “Environment Admin” role; the ability to access these environments requires an “Environment User” grant for the particular environment. The ability to scale and tune resource usage (headroom, autoscale parameters) for individual experiences can be managed by folks granted the admin role for each particular service ( e.g. “data warehouse admins” for CDW, and “ML admins” for CML). We’ll also be adding finer-grained role-based access controls to CDP services and CRUD operations such as Data Hub Admin and Data Lake Admin. Is CDP serverless? The CDP Management console and control plane run as a service in a Cloudera account. It talks to your VPC and cloud account to provision machines for its SDX data lake cluster and for the workloads that run on it. Data Hubs use virtual machines and effectively provide a cluster-as-a-service. The cloud-optimized experiences such as CDW and CML give CDP the ability to control the resources provided for workloads and gives data users a serverless computing experience. How to migrate data from CDH or HDP to CDP? Use CDP Replication Manager (RM) to replicate all your data, metadata, and Sentry permissions or Apache Ranger policies into the cloud. That is, RM will automatically move all your workloads into the cloud. You showed Kafka replication from a data center to the public cloud. Can I also set up replications going both ways? From the cloud to the data center? Yes, you can! Do you support moving data between the on-prem and cloud versions of CDP? Yes, NiFi is one of the options for moving data back and forth between on-prem and cloud versions of the platform in a secure and resilient way. Is CDP programmatic? Could I create a Spark cluster with an API? Yes. CDP has a command-line interface (CLI) that can be used to create and destroy datahubs, and workloads, as well as interact with the control plane for user management and automation. This was designed to enable a modern CI/CD with “configuration as code”. Can we have different versions of, say, Kafka in CDP? Yes. Customers can create clusters (specifically using CDP Data Hub) with any version of the Cloudera Runtime. While there are constraints on versions depending on the version used for the Data Lake providing the authorization, audit, lineage and data catalog for the workload, we are working on supporting mixed workloads even within one environment. What would be the performance impact of having data in S3 compared to in HDFS? Our benchmarks indicate very similar performance characteristics between HDFS and cloud storage. We have implemented several caching improvements across the components to optimize for cloud storage usage. Where is HDFS in the public cloud experiences? In the public cloud, CDP uses native cloud storage (S3, ADLS) to store the data. While HDFS is used internally in the CDP Data Lake (SDX) cluster by Hadoop services, HDFS is not used to store end-user data in CDP. Are there data encryption and decryption options in CDP? Yes. CDP offers functionality for en-/decryption of data-at-rest as well as data-in-motion. Do groups of ID brokers require direct access to an ID provider? We’re moving to a federation model. Anything that is SAML 2.0 compliant you can federate your users to CDP using that model. When will new experiences, such as the recently announced Cloudera Data Engineering on Public Cloud, be available on Private Cloud? It is part of our roadmap to have complete parity in the experiences available across both CDP Public Cloud and CDP Private Cloud. More details on exact timelines will be coming soon to our customers. What is the plan to enable Private Cloud on top of other Kubernetes platforms? CDP is meant to be a cloud-agnostic platform, across both Public Cloud and Private Cloud. Our roadmap contains many improvements to the dependencies on the underlying Kubernetes platform, including supporting additional distributions. How does CDP differ on Private Cloud compared to Public Cloud? From an end-user perspective, the two are very similar and that’s intentional. Part of our hybrid value proposition relies on offering a consistent experience in both public and private cloud environments. The differences come in more at the platform and infrastructure admin level. More details can be found in our public docs. WORKLOAD MANAGER Can I connect a single instance of Workload Manager to multiple environments or clusters? Yes, you can connect multiple environments or clusters to Workload Manager Does Workload Manager support all CDP deployments? Yes, Workload Manager supports all CDP deployments. Which engines does Workload Manager support? WM supports all the key Cloudera engines including Apache Hive, MR, Impala and Spark. DATA VISUALIZATION How many visual types does CDP Data Visualization provide? CDP Data Visualization offers 34 visual types out of the box, and the ability to add custom extensions as needed. What is a visual? Is it a dashboard or an app? Both. CDP Data Visualization enables intuitive drag-and-drop dashboarding and no-code custom application creation that can be published and shared everywhere. See this blog on CDP Data Visualization for more info.

rhryniewicz · ‎11-05-2018

Our Q4 developer newsletter is out! Read about the Open Hybrid Architecture initiative, checkout Hadoop 3 blog series, Deep Learning 101 podcast, and our best HCC articles. Check it out HERE. Previous newsletters: Q3 Developer Newsletter Q2 Developer Newsletter Q1 Developer Newsletter Please post your feedback and suggestions in the comments section. Thanks! Robert

rhryniewicz · ‎10-30-2018

I've just stared the third edition of the fastai course with Jeremy Howard. There’s been a lot of buzz around fast.ai and how the non-profit is making deep learning accessible to hundreds of thousands of developers. The latest mention was in the Economist: https://www.economist.com/business/2018/10/27/new-schemes-teach-the-masses-to-build-ai. For me the most exciting part of the course is learning how to get cutting edge results (in the 90%+ accuracy range) with just a few lines of code using fastai library methods that have best practices baked in. Below, I'll present a few lines of code that allow you to quickly classify different breeds of cats and dogs. You may recall that distinguishing between cats and dogs was big just a few years ago but now it's too easy. Thus, we’re using the Oxford-IIIT Pet Dataset http://www.robots.ox.ac.uk/~vgg/data/pets/. The code example references the latest fastai library v1.0.x built on top of PyTorch. See the github repo for more details https://github.com/fastai. So let’s get started! First, import prerequisite libraries from fastai import * from fastai.vision import * Set training batch size to 64 bs = 64 Note: if your GPU is running out of memory, set a smaller batch size, e.g. 32 or 16. Assuming path points to your dataset of pet images, where the image labels (representing type of pet breed) are the folder name, we use a handy data preparation method ImageDataBunch. We set our validation set to 20% and transform all the images to size 224. (The size 224 is set as a multiplier of 7 which is optimal for the ResNet-34 model used in this example.) data = ImageDataBunch.from_folder(path, train=".", valid_pct=0.2, ds_tfms=get_transforms(), size=224) Also, let’s normalize our dataset data.normalize(imagenet_stats) And preview our training data data.show_batch(rows=3, figsize=(7,8)) Now we’re ready to train the model. We’re using a pre-trained convolutional neural net ResNet-34. (To learn more about convolutional neural networks see https://cs231n.github.io/convolutional-networks/.) learn = ConvLearner(data, models.resnet34, metrics=error_rate) Now let’s train the last layer of model in four epochs (or cycles) learn.fit_one_cycle(4) And here’s the model training output Total time: 02:14 epoch train_loss valid_loss error_rate 1 1.169139 0.316307 0.097804 (00:34) 2 0.510523 0.229121 0.072522 (00:33) 3 0.337948 0.201726 0.065868 (00:33) 4 0.242196 0.189312 0.060546 (00:33) As you can see, after four epochs and total time of 2 min and 14 sec we get a model with 94% accuracy (error rate 6.0546%). For comparison, the state-of-the-art classification accuracy on this dataset in 2012 was only 59%! Here's the 2012 paper: http://www.robots.ox.ac.uk/~vgg/publications/2012/parkhi12a/parkhi12a.pdf. Final comments Of course, we could further fine tune this model and adjust the weights across all 34 layers. We could also replace ResNet-34 with a larger model, e.g. ResNet-50. You can check the Stanford Deep Learning benchmark site https://dawn.cs.stanford.edu/benchmark/ for the top performing models. (As of Sep 2018 ResNet-50 was the top one.) If you do decide to use ResNet-50 for your training, make sure to set image size to 320. Also, for ResNet-50 your GPU should have at least 11GB of memory. If you want to learn more about the fast.ai course, here's the link: https://course.fast.ai/

rhryniewicz · ‎03-01-2018

Our Q1 newsletter is out! Please check it out and let us know what you think. http://info.hortonworks.com/rs/549-QAL-086/images/hcc-newsletter-mar-2018.html Suggestions super appreciated 🙂 Thanks!

Online	Offline
Last Visited	‎08-22-2022 05:50 PM

Member Since	‎08-13-2019 11:08 AM
Last Visited	‎08-22-2022 05:50 PM
Posts	47
Kudos received	39

Cloudera Community

Re: Trouble Importing JSON for INTRO TO MACHINE LE...

Re: %jdbc(hive) prefix not found in Zeppelin

Re: SQLContext Error - CreateSchemaRDD

Cloudera Data Engineering (CDE) - Questions & Answ...

Cloudera Operational Database (COD) - Questions & ...

Cloudera Data Warehouse (CDW) - Questions & Answer...

Cloudera Machine Learning (CML) - Questions & Answ...

Cloudera DataFlow (CDF) - Questions & Answers

Security and Governance (SDX) - Questions & Answer...

Cloudera Data Platform (CDP) - Questions & Answers

Q4 Dev Newsletter - Open Hybrid Architecture, Deep...

Quick intro to fastai library

Hortonworks Newsletter - Q1 2018