Which data warehousing engines are available in CDW?
Hive for EDW and complex report building and dashboarding, Impala for interactive SQL and ad hoc exploration, Kudu for time-series and Druid for log analytics.
How do you create data warehouses?
Step 1, create your CDP environment. Step 2, activate the CDW service. Step 3, create your virtual warehouse. Step 4, define tables, load data, run queries, integrate your BI tool, etc.
What’s the relationship between a Database Catalog and Virtual Warehouse?
For each database catalog there can be one or more virtual warehouses. But each virtual warehouse is isolated from other warehouses while they share the same data and metadata.
What’s the CDW query performance in the cloud when using remote storage?
CDW uses different levels of caching to offset object storage access latency. This includes a data cache on each query execution node, a query result cache (for Hive LLAP), and Materialized Views (for Hive LLAP).
How does CDW handle workloads with high concurrency?
One, by designing a query engine to be as efficient as possible to maximize the number of concurrent queries, e.g. with runtime code generation. Two, if you know you will have a lot of concurrent queries you can choose a larger virtual warehouse size, e.g. medium or large. Three, utilize auto-scaling to spin up more nodes (by creating new executor groups) as query concurrency increases.
What tools are available for tuning and troubleshooting CDW?
CDP Workload Manager collects telemetry information from Hive and Impala queries and Spark jobs, then profiles them. It then automatically identifies errors and inefficiencies and makes suggestions for resolution. This tool helps with troubleshooting and data lifecycle optimization.
What’s the best way to run diagnostics in CDW?
All the log files are stored in object storage, e.g. S3. We also have Grafana to access all the metrics, e.g. what’s running on different executors.