About athacker

athacker · ‎01-03-2019

This is the second article in the 2-part series (see Part 1) where we look at yet another architectural variation of the distributed pricing engine discussed here, this time leveraging a P2P cluster design through the Akka framework deployed on Kubernetes with a native API Gateway using Ambassador that in turn leverages Envoy, an L4/L7 proxy, and monitoring infrastructure powered by Prometheus. The Cloud Native Journey In the previous article, we just scratched the surface of one of the top level CNCF projects – Kubernetes - the core container management, orchestration and scheduling platform. In this article, we will take our cloud native journey to the next level by exploring more nuanced aspects of Kubernetes (support for stateful applications through StatefulSets, Service, Ingress) including the use of package manager for Kubernetes – Helm. We will also look at the two other top level projects of CNCF – Prometheus and Envoy and how these form the fundamental building blocks of any production viable cloud native computing stack. Application and Infrastructure Architecture We employ a peer-to-peer cluster design for the distributed compute engine leveraging the Akka Cluster framework. Without going into much details of how the framework works (see documentation for more), it essentially provides a fault-tolerant decentralized P2P cluster membership service through a gossip protocol with no SPOF. The cluster nodes need to register themselves with an ordered set of seed nodes (primary/secondaries) in order to discover other nodes. This necessitates the seed nodes to have state semantics with respect to leadership assignment and stickiness of addressable IP addresses/DNS names. A pair of nodes (for redundancy) are designated as seed nodes that act as the controller (ve-ctrl) which: Provides contact points for the valengine (ve) worker nodes to join the cluster Exposes RESTful service (using Akka HTTP) for job submission from an HTTP client Dispatches jobs to the registered ve workers The other non-seed cluster nodes act as stateless workers that are routed compute job requests and perform the actual pricing task. This facilitates elasticity of the compute function as worker nodes join and leave the cluster depending on the compute load. So, how do we deploy the cluster on Kubernetes? We have the following 3 fundamental requirements: The seed controller nodes need to have state semantics with respect to leadership assignment and stickiness of addressable IP addresses/DNS names The non-seed worker nodes are stateless compute nodes that need to be elastic (be able to auto-scale in/out based on compute load) Load-balanced API service to facilitate pricing job submissions via REST The application is containerized using Docker, packaged using Helm and the containers managed and orchestrated with Kubernetes leveraging Google Kubernetes Engine (GKE) and using its following core constructs: StatefulSets - Manages the deployment and scaling of a set of Pods, and provides guarantees about the ordering and uniqueness of these Pods. Deployment/ReplicaSets – Manages the deployment and scaling of a homogenous set of stateless pods in parallel Service - An abstraction which defines a logical set of Pods and a policy by which to access them through the EndPoints API that reflects updates in the state of associated Pods quite naturally, as follows: The controller (ve-ctrl) seed nodes are deployed as StatefulSets as an ordered set with stable (across pod (re)scheduling) network co-ordinates The valengine (ve) workers are deployed as ReplicaSets managed through a Deployment decorated with semantics for autoscaling in parallel (as against sequentially as is the case with StatefulSets) based on compute load The controller pods are front by a K8s Service exposing the REST API to submit compute jobs So far so good. Now how do we access the service and submit pricing and valuation jobs? Diving a little deeper into the service aspects – our applications K8s Service is of type ClusterIP making it reachable only from within the K8s cluster on a cluster-internal private IP. Now we can externally expose the service using GKE’s load balancer by using the LoadBalancer service type instead and manage (load balancing, routing, SSL termination etc.) external access through Ingress. Ingress is a beta K8s resource and GKE provides an ingress controller to realize the ingress semantics. Deploying and managing a set of services benefits from the realization of a service mesh, as well articulated here and necessitates a richer set of native features and responsibilities that can broadly be organized at two levels – The Data Plane – Service co-located and service specific proxy responsible for service discovery, health checks, routing, rate-limiting, load balancing, retries, auth, observability/tracing of request/responses etc. The Control Plane – Responsible for centralized policy, configuration definition and co-ordination of multiple stateless isolated service specific Data Planes and their abstraction as a singular distributed system In our application, we employ a K8s native, provider (GKE, AKS, EKS etc.) agnostic API Gateway with these functions incorporated using Envoy as the L7 sidecar proxy through an open source K8s native manifestation in the form of Ambassador. Envoy abstracts the application service from the Data Plane networking functions described above and the network infrastructure in general. Ambassador facilitates the Control Plane by providing constructs for service definition/configuration that speak the native K8s API language and generating a semantic intermediate representation to apply the configurations onto Envoy. Ambassador also acts as the ‘ingress controller’ thereby enabling external access to the service. Some of the other technologies one might consider for: Data Plane - Traefik, NGNIX, HAProxy, Linkerd Control Plane - Istio, Contour (Ingress Controller) Finally, the application compute cluster and service mesh (Ambassador and Envoy) is monitored with Prometheus deployed using an operator pattern - an application-specific controller that extends the Kubernetes API to create, configure, and manage instances of complex stateful applications on behalf of a Kubernetes user - through the prometheus-operator which makes the Prometheus configuration K8s native and helps manage the Prometheus cluster. We deploy the Prometheus-operator using Helm which in turn deploys the complete toolkit including AlertManager and Grafana with in-built dashboards for K8s. The metrics from Ambassador and Envoy are captured by using the Prometheus StatsD Exporter, fronting it with a Service (StatsD Service) and leveraging the ServiceMonitor custom resource definition (CRD) provided by the prometheus-operator to help monitor and abstract configuration to the StatsD exporting service for Envoy. Tech Stack Kubernetes (GKE) 1.10 Helm 2.11 Prometheus 2.6 Envoy 1.8 Ambassador 0.40.2 Akka 2.x To Try this at Home All application and infrastructure code is the following repo: https://github.com/amolthacker/hwx-pe-k8s-akka Pre-requisites: GCP access with Compute and Kubernetes Engine and following client utilities setup: gcloud kubectl helm Clone the repo Update `scripts/env.sh` with GCP region and zone details Bootstrap the infrastructure with following script: $ ./scripts/setup-cluster.sh Once the script completes you should the above stack up and running and ready to use Controller seed node (ve-ctrl) pods Worker node (ve) pods - base state Behind the Scenes The script automates the following tasks: Enables GCP Compute and Container APIs $ gcloud services enable compute.googleapis.com $ gcloud services enable container.googleapis.com Creates a GKE cluster with auto-scaling enabled $ gcloud container clusters create ${CLUSTER_NAME} --preemptible --zone ${GCLOUD_ZONE} --scopes cloud-platform --enable-autoscaling --min-nodes 2 --max-nodes 6 --num-nodes 2 Configures kubectl and defines bindings for RBAC $ gcloud container clusters get-credentials ${CLUSTER_NAME} --zone ${GCLOUD_ZONE} $ kubectl create clusterrolebinding ambassador --clusterrole=cluster-admin --user=${CLUSTER_ADMIN} --serviceaccount=default:ambassador Sets up Helm Deploys the application `hwxpe` using Helm $ helm install --name hwxpe helm/hwxpe $ kubectl autoscale deployment ve --cpu-percent=50 --min=2 --max=6 This uses the pre-built image amolthacker/ce-akka Optionally one can rebuild the application with Maven $ mvn clean package and use the new image regenerated as part of the build leveraging the docker-maven-plugin To build the application w/o generating Docker image: $ mvn clean package -DskipDocker Deploys the Prometheus infrastructure $ helm install stable/prometheus-operator --name prometheus-operator --namespace monitoring Deploys Ambassador (Envoy) $ kubectl apply -f k8s/ambassador/ambassador.yaml $ kubectl apply -f k8s/ambassador/ambassador-svc.yaml $ kubectl apply -f k8s/ambassador/ambassador-monitor.yaml $ kubectl apply -f k8s/prometheus/prometheus.yaml Adds the application service route mapping to Ambassador $ kubectl apply -f k8s/ambassador/ambassador-ve-svc.yaml And finally displays the Ambassador facilitated service's external IP and forwards ports for Prometheus UI, Grafana and Ambassador to make them accessible from the local browser We are now ready to submit pricing jobs to the compute engine using its REST API Job Submission To submit valuation batches, simply grab the external IP displayed and call the script specifying the batch size as follows: As you can see the script uses external IP 35.194.69.254 associated with the API Gateway (Ambassador/Envoy) fronting the application service and route prefix 've-svc' as provided in the Ambassador service spec above. The service routes the request to the backing controller seed node pods (ve-ctrl) which in turn divides and distributes the valuation batch across the set of worker pods (ve) as follows: Controller Seed Node [ve-ctrl] logs Worker Node [ve] logs As the compute load increases, one would see the CPU usage spike beyond the set threshold and as a result the compute cluster autoscale out (eg: from 2 worker nodes to 4 worker nodes in this case) and then scale back in (eg: from 4 back to the desired min 2 in this case) Worker node [ve] pods - scaled out Autoscaling event log Concluding Thank you for your patience and bearing with the lengthy article ... Hope the prototype walkthrough provided a glimpse of the exciting aspects of building cloud native services and applications !! Credits and References https://kubernetes.io/ https://www.getambassador.io/user-guide/getting-started/ https://www.getambassador.io/reference/statistics#prometheus https://blog.envoyproxy.io/service-mesh-data-plane-vs-control-plane-2774e720f7fc https://docs.helm.sh/using_helm https://github.com/prometheus/statsd_exporter

athacker · ‎12-11-2018

Article This is the first of the two articles (see Part 2) where we will at look at Kubernetes (K8s) as the container orchestration and scheduling platform for a grid like architectural variation of the distributed pricing engine discussed here. Kubernetes @ Hortonworks But wait … before we dive in … why are we talking about K8s here? Well, if you missed KubeCon + CloudNativeCon, find out more! Hortonworks joined CNCF with an increased commitment to cloud native solutions for big data management. Learn more about the Open Hybrid Architecture Initiative led by Hortonworks to enable consistent data architectures on-prem and/or across cloud providers and stay tuned to discover more about how K8s is used across the portfolio of Hortonworks products – DPS, HDF and HDP to propel big data’s cloud native journey! The Application Architecture With that said, this prototype is a stateless compute application – a variation of the Dockerized Spark on YARN version described here - that comprises of a set of Dockerized microservices written in Go and AngularJS orchestrated with Kubernetes as depicted below: The application has two components: compute-engine A simple client-server application written in Go. eodservice simulates the client that submits valuation requests to a grid of remote compute engine (server) instances - valengine. The valengine responds to requests by doing static pricing compute using QuantLib library through a Java wrapper library - computelib. The eodservice and valengine communicate with each other using Kite RPC (from Koding). eodservice breaks job submissions into batches with a max size of 100 (pricing requests) and the computelib prices them in parallel. compute-manager A web-based management user interface for the compute-engine that helps visualize the K8s cluster, pod deployment and state and facilitates the following operational tasks: Submission of valuation jobs to the compute engine Scaling of compute Aggregated compute engine log stream The backend is written in Go and frontend with AngularJS The App on K8s So how is this deployed, scheduled and orchestrated with K8s? To begin with, The compute-engine's docker image builds on top of a pre-baked image of the computelib FROM amolthacker/mockcompute-base LABEL maintainer="amolthacker@gmail.com" # Paths ENV COMPUTE_BASE /hwx-pe/compute/ # Copy lib and scripts RUN mkdir -p $COMPUTE_BASE ADD . $COMPUTE_BASE RUN chmod +x $COMPUTE_BASE/compute.sh RUN cp $COMPUTE_BASE/compute.sh /usr/local/bin/. RUN cp $COMPUTE_BASE/mockvalengine-0.1.0.jar /usr/local/lib/. # Start Engine ENTRYPOINT go run $COMPUTE_BASE/valengine.go # Ports: 6000 (RPC) | 8000 (HTTP-Health) EXPOSE 6000 8000 The stateless valengine pods are compute-engine Docker containers exposing ports 6000 for RPC comm and 8000 (HTTP) for health check that are rolled out with the ReplicaSets controller through a Deployment and Front by Service (valengine-prod) of type LoadBalancer so as to expose the Service using the cloud platform’s load balancer – in this case Azure Container Service (ACS) with Kubernetes orchestrator. This way the jobs are load balanced across the set of available valengine pods. Note: ACS has recently been deprecated and impeding retirement in favor of aks-engine for IaaS Kubernetes deployments and Azure Kubernetes Service (AKS) for a managed PaaS option. Additionally, the Deployment is configured with a Horizontal Pod Autoscaler (HPA) for auto-scaling the valengine pods (between 2 and 8 pods) once a certain CPU utilization threshold is reached. One can use the compute-manager’s web interface to submit compute batches stream logs on the web console and watch the engines auto-scale based on the compute workload The Demo - https://youtu.be/OFkEZKlHIbg The demo will: First go through deployment specifics Submit a bunch of compute jobs Watch the pods auto-scale out See the subsequent job submissions balanced across the scaled-out compute engine grid See the compute engine grid scale back in after a period of reduced activity The Details Prerequisites Azure CLI tool (az) K8s CLI Tool (kubectl) First, we automate the provisioning of K8s infrastructure on Azure through an ARM template $ az group deployment create -n hwx-pe-k8s-grid-create -g k8s-pe-grid --template-file az-deploy.json --parameters @az-deploy.parameters.json Point kubectl to this K8s cluster $ az acs kubernetes get-credentials --ssh-key-file ~/.ssh/az --resource-group=k8s-pe-grid --name=containerservice-k8s-pe-grid Start the proxy $ nohup kubectl proxy 2>&1 < /dev/null & Deploy the K8s cluster – pods, ReplicaSet, Deployment and Service using kubectl $ kubectl apply -f k8s/ Details on K8s configuration here. We then employ the compute-manager and the tooling and library dependencies it uses to: Stream logs from the valengine pods onto the web console Talk to the K8s cluster using v1 APIs through the Go client Submit compute jobs to the compute engine’s eodservice The compute-manager uses Go’s concurrent constructs such as goroutines and channels func handleSubmitJob(w http.ResponseWriter, r *http.Request) { metric := mux.Vars(r)["metric"] if !contains(metric, SUPPORTED_METRICS) { http.Error(w, metric+" is not a supported metric", http.StatusInternalServerError) return } numTrades := mux.Vars(r)["numTrades"] id := rand.Intn(1000) idStr := strconv.Itoa(id) now := time.Now() jobInfo := JobInfo{idStr, metric, numTrades, now, now, "Running"} addJobInfo(&jobInfo) nT, _ := strconv.Atoi(numTrades) batchSize := 100 for nT > batchSize { nT = nT - batchSize go runJob(idStr, metric, strconv.Itoa(batchSize)) } go runJob(idStr, metric, strconv.Itoa(nT)) w.Write([]byte("OK")) } func cMapIter() <-chan *JobInfo { c := make(chan *JobInfo) f := func() { cMap.Lock() defer cMap.Unlock() var jobs []*JobInfo for _, job := range cMap.item { jobs = append(jobs, job) c <- job } close(c) } go f() return c } for concurrent asynchronous job submission, job status updates and service health checks. For more, check out the repo. Coming Up Next … Hope you enjoyed a very basic introduction to stateless micro-services with Kubernetes and a celebration of Hortonworks’ participation in CNCF. Next, in Part 2, we will look at another alternative architecture for this compute function leveraging some of the more nuanced Kubernetes functions, broader cloud native stack to facilitate proxying and API management and Autoscaling with Google Kubernetes Engine (GKE).

athacker · ‎07-16-2018

A distributed compute engine for pricing financial derivatives using QuantLib with Spark running in Docker containers on YARN with HDP 3.0. Introduction Modern financial trading and risk platforms employ compute engines for pricing and risk analytics across different asset classes (equities, fixed income, FX etc.) to drive real-time trading decisions and quantitative risk management. Pricing financial instruments involves a range of algorithms from simple cashflow discounting to more analytical methods using stochastic processes such as Black-Scholes for options pricing to more computationally intensive numerical methods such as finite differences, Monte Carlo and Quasi Monte Carlo techniques depending on the instrument being priced – bonds, stocks or their derivatives – options, swaps etc. and the pricing (NPV, Rates etc) and risk (DV01, PV01, higher order greeks such as gamma, vega etc.) metrics being calculated. Quantitative finance libraries, typically written in low level programming languages such as C, C++ leverage efficient data structures and parallel programming constructs to realize the potential of modern multi-core CPU, GPU architectures, and even specialized hardware in the form of FPGAs and ASICs for high performance compute of pricing and risk metrics. Quantitative and regulatory risk management and reporting imperatives such as valuation adjustment calculations XVA (CVA, DVA, FVA, KVA, MVA) for FRTB, CCAR and DFAST in the US or MiFID in Europe for instance, necessitate valuation of portfolios of millions of trades across tens of thousands of scenario simulations and aggregation of computed metrics across a vast number and combination of dimensions – a data-intensive distributed computing problem that can benefit from: Distributed compute frameworks such as Apache Spark and Hadoop that offer scale-out shared-nothing, fault-tolerant, and data-parallel architectures that are more portable and have more palatable easier to use APIs, as compared to HPC frameworks such as MPI, OpenMP etc. Elasticity and operational efficiencies of cloud computing especially with burst compute semantics for these use cases augmented by the use of OS virtualization through containers and lean DevOps practices. In this article, we will look to capture the very essence of problem space discussed above through a trivial implementation of a compute engine for pricing financial derivatives that combines the facilities of parallel programming through QuantLib, an open source library for quantitative finance embedded in a distributed computing framework Apache Spark running in an OS virtualized environment through Docker containers on Apache Hadoop YARN as the resource scheduler provisioned, orchestrated and managed in OpenStack private cloud through Hortonworks Cloudbreak all through a singular platform in the form of HDP 3.0. Pricing Semantics The engine leverages QuantLib to compute: Spot Price of a Forward Rate Agreement (FRA) using the library’s yield term structure based on flat interpolation of forward rates NPV of a vanilla fixed-float (6M/EURIBOR-6M) 5yr Interest Rate Swap (IRS) NPV of a European Equity Put Option averaged over multiple algorithmic calculations (Black-Scholes, Binomial, Monte-Carlo) The requisite inputs for these calculations – yield curves, fixings, instrument definitions etc. are statically incorporated for the purpose of demonstration. Dimensional aggregation of these computed metrics for a portfolio of trades is trivially simulated by performing this calculation for a specified number of N times (portfolio size of N trades) and computing a mean. Technical Details The engine basically exploits the embarrassingly parallel nature of the problem around independent parallel pricing tasks leveraging Apache Spark in its map phase and computing mean as the trivial reduction operation. A more real-world consideration of the pricing engine would also benefit from distributed in-memory facilities of Spark for shared ancillary datasets such as market (quotes), derived market (curves), reference (term structures, fixings etc.) and trade data as inputs for pricing, typically subscribed to from the corresponding data services using persistent structures leveraging data-locality in HDFS. These pricing tasks essentially call QuantLib library functions that can range from scalars to parallel algorithms discussed above to price trades. Building, installing and managing the library and associated dependencies across a cluster of 100s of nodes can be cumbersome and challenging. What if now one wants to run regression tests on a newer version of the library in conjunction with the current? OS virtualization through Docker is a great way to address these operational challenges around hardware and OS portability, build automation, multi-version dependency management, packaging and isolation and integration with server-side infrastructure typically running on JVMs and of course more efficient higher density utilization of the cluster resources. Here, we use a Docker image with QuantLib over a layer of lightweight CentOS using Java language bindings through SWIG. With Hadoop 3.1 in HDP 3.0, YARN’s support for LinuxContainerExecutor beyond the classic DefaultContainerExecutor facilitates running multiple container runtimes – the DefaultLinuxContainerRuntime and now the DockerLinuxContainerRuntime side by side. Docker containers running QuantLib with Spark executors, in this case, are scheduled and managed across the cluster by YARN using its DockerLinuxContainerRuntime. This facilitates consistency of resource management and scheduling across container runtimes and lets Dockerized applications take full advantage of all of the YARN RM & scheduling aspects including Queues, ACLs, fine-grained sharing policies, powerful placement policies etc. Also, when running Spark executor in Docker containers on YARN, YARN automatically mounts the base libraries and any other requested libraries as follows: For more details – I encourage you to read these awsome blogs on containerization in Apache Hadoop YARN 3.1 and containerized Apache Spark on YARN 3.1 This entire infrastructure is provisioned on OpenStack private cloud using Cloudbreak 2.7.0 which first automates the installation of Docker CE 18.03 on CentOS 7.4 VMs and then the installation of HDP 3.0 cluster with Apache Hadoop 3.1 and Apache Spark 2.3 using the new shiny Apache Ambari 2.7.0 Tech Stack Hortonworks Cloudbreak 2.7.0 Apache Ambari 2.7.0 Hortonworks Data Platform 3.0.0 (Apache Hadoop 3.1, Apache Spark 2.3) Docker 18.03.1-ce on CentOS 7.4 QuantLib 1.9.2 Please Do Try This at Home Ensure you have access to a Hortonworks Cloudbreak 2.7.0 instance. You can set one up locally by using this project: https://github.com/amolthacker/hwx-local-cloudbreak. Please refer to the documentation to meet the prerequisites and setup credentials for the desired cloud provider. Clone the repo: https://github.com/amolthacker/hwx-pricing-engine Update the following as desired: Infrastructure definition under cloudbreak/clusters/openstack/hwx-field-cloud/hwx-pe-hdp3.json. Ensure you refer to right Cloudbreak base image Ambari blueprint under cloudbreak/blueprints/hwx-pe-hdp3.json Now upload the following to your Cloudbreak instance: Ambari Blueprint: cloudbreak/blueprints/hwx-pe-hdp3.json Recipe to install Docker CE with custom Docker daemon settings: cloudbreak/recipes/pre-ambari/install-docker.sh Now execute the following using Cloudbreak CLI to provision the cluster: cb cluster create --cli-input-json cloudbreak/clusters/openstack/hwx-field-cloud/hwx-pe-hdp3.json --name hwx-pe This will first instantiate a cluster using the cluster definition JSON and the referenced base image, download packages for Ambari and HDP, install Docker (a pre-requisite to running Dockerized apps on YARN) and setup the DB for Ambari and Hive using the recipes and then install HDP 3 using the Ambari blueprint. Once the cluster is built, you should be able to log into Ambari to verify Now, we will configure YARN Node Manager to run LinuxContainerExecutor in non-secure mode, just for demonstration purpose, so that all Docker containers scheduled by YARN will run as ‘nobody’ user. Kerberized cluster with cgroups enabled is recommended for production. Enable Docker Runtime for YARN Update yarn-site.xml and container-executor.cfg as follows: A few configurations to note here: Setting yarn.nodemanager.container-exexutor.class to use LinuxContainerExecutor Setting min.user.id to a value (50) less than user id of user ‘nobody’ (99) Mounting /etc/passwd on read-only mode into the Docker containers to expose the spark user Adding requisite Docker registries to the trusted list Now restart YARN SSH into the cluster gateway node and download the following from repo: compute-engine-spark-1.0.0.jar $ wget https://github.com/amolthacker/hwx-pricing-engine/raw/master/compute-engine-spark-1.0.0.jar compute-price.sh $ wget https://github.com/amolthacker/hwx-pricing-engine/blob/master/compute/scripts/compute-price.sh Notice the directives around using Docker as executor env for Spark on YARN in client mode. You should now be ready to simulate a distributed pricing compute using the following command: ./compute-price.sh <metric> <numTrades> <numSplits> where metric: FwdRate: Spot Price of Forward Rate Agreement (FRA) NPV: Net Present Value of a vanilla fixed-float Interest Rate Swap (IRS OptionPV: Net Present Value of a European Equity Put Option average over multiple algorithmic calcs (Black-Scholes, Binomial, Monte Carlo) eg: ./compute-price.sh OptionPV 5000 20 And see the job execute as follows: Wrapping up … HDP 3.0 is pretty awesome right !!! It went GA on Friday and I can tell you, if a decade ago you thought Hadoop was exciting, this will blow your mind away!! In this article here, we’ve just scratched the surface and looked at only one of the myriad compute centric aspects of innovation in the platform. For a more detailed read on platform capabilities, direction and unbound possibilities I urge you to read the blog series from folks behind this. References https://www.quantlib.org/ https://hortonworks.com/blog/trying-containerized-applications-apache-hadoop-yarn-3-1/ https://hortonworks.com/blog/containerized-apache-spark-yarn-apache-hadoop-3-1/

athacker · ‎05-18-2018

Continuing with the theme of one-click installs from the last post, this time we will see how one can provision an HDP 2.6.4 cluster equipped with features for data governance and security through Apache Atlas and Apache Ranger using Cloudbreak in private/public cloud, again of course, with a single command! To begin with, you’d need: Access to a supported cloud provider (AWS, Azure, GCP, OpenStack) with following resources configured / identified for the cluster deployment: Region / Availability zone Virtual Private Network / Security Group VM types / sizes identified for master and worker roles SSH keypair Access to a Hortonworks Cloudbreak instance setup with credentials for the cloud provider(s) of choice [eg: OpenStack, AWS] Cloudbreak CLI installed on your machine, in path and configured pointing to the Cloudbreak instance above You can use this tool to setup Cloudbreak with CLI configured locally on your workstation with just a couple of commands. Once this is done, you should have your Cloudbreak instance setup with the credentials as follows: You should also be able to use the Cloudbreak CLI to talk to the Cloudbreak instance and in turn the cloud provider with their respective credentials. Now, clone this repo. Update cloud configurations under: cloudbreak/clusters/aws/hwx/hwx-aws-dm.json for AWS cloudbreak/clusters/openstack/hwx-field-cloud/hwx-os-dm.json for OpenStack especially: general.credentialName // credential to use as configured in Cloudbreak tags // tags for billing, ops and audit placement // region and availability zone network // configured virtual private network instanceGroups.template.instanceType // instance types instanceGroups.template.securityGroup // configured security group associated with the network stackAuthentication // configured SSH key details Setup Cloudbreak and Create Cluster If running for the first time, run: ./scripts/import-artifacts-n-create-cluster.sh <CLOUD> [where CLOUD => 'openstack' or 'aws'] This will first import the blueprint (under cloudbreak/blueprints) and recipes (under cloudbreak/recipes) into Cloudbreak and then create the cluster If the blueprints and recipes have already been imported, run: ./scripts/create-cluster.sh <CLOUD> You should now see the cluster being provisioned ... I also initiated provisioning of the cluster in AWS ./scripts/create-cluster.sh aws After about 15-20 mins … You have HDP clusters setup with Atlas, Ranger and their dependencies including HBase and Kafka all equipped with features for data governance and security!! A quick peek behind the scenes! The bulk of the provisioning automation is done through the Ambari blueprint. In this case, cloudbreak/blueprints/hdp26-data-mgmnt.json One of the first steps and provisioning script does is import this custom Ambari blueprint into Cloudbreak using the CLI - visible in the catalog alongside 3 built-in ones. Ambari blueprints are the declarative definition of your clusterdefining the host groups and which components to install on which host group. Ambari uses them as a base for your clusters. In general, the blueprint to be used with Cloudbreak should include the following elements: The blueprint for this setup has 2 host_groups – master and worker for the respective service roles of all the components. The configurations section of the blueprint defines mostly the non-default override properties for various services especially Atlas and Ranger. Refer to the docs for more on Ambari support for Ranger and Atlas. You can also look at sample blueprints for shared services. To automate provisioning of more nuanced clusters, you can manually build the clusters through Ambari installation wizard and export the blueprint. However, the “blueprint_name" is not included in the export. You must add it before the blueprint can be used by Cloudbreak. You can use these Gists for exporting and adding blueprint_name to the exported blueprint. This setup uses Postgres as the RDBMS required to store metadata and state information for Ambari and other services like Hive, Ranger etc. (You can update the blueprint to use other supported RDBMSes). In order for Ranger to use this Postgres instance, one needs to create a DB and a DB user for Ranger and add authentication information in pg_hba.conf. This is automated using a ‘pre-ambari-start’ recipe under recipes/ranger-db-setup.sh that runs on the Ambari server node group. Recipes are scripts that run on selected node groups before or after cluster installation to help automate scenarios not handled by Ambari blueprints. See here for more on Cloudbreak recipes. The provisioning script imported these recipes into Cloudbreak using the CLI, making them available for automation of pre/post cluster install steps. The 2 other recipes you see are: ranger-db-setup-aws: The AWS counterpart for the ranger-db-setup recipe with customizations for Amazon Linux image restart-atlas: Restarts Atlas post install for completion The last set of artifacts that the provisioning script imports into Cloudbreak are the cluster configurations, in this case, for OpenStack and AWS under: cloudbreak/clusters/aws/hwx/hwx-aws-dm.json for AWS cloudbreak/clusters/openstack/hwx-field-cloud/hwx-os-dm.json for OpenStack with the following details: general.credentialName // credential to use as configured in Cloudbreak tags // tags for billing, ops and audit placement // region and availability zone network // configured virtual private network instanceGroups.template.instanceType // instance types instanceGroups.template.securityGroup // configured security group associated with the network stackAuthentication // configured SSH key details Once all the artifacts are in place the provisioning script uses the following CLI command to initiate cluster install: cb cluster create --cli-input-json $config_file --name $cluster_name To Wrap Up … Hope you enjoyed the article and see how Apache Ambari and Hortonworks Cloudbreak facilitate devops and automation for provisioning complex Hadoop/Spark clusters equipped with enterprise features for governance, security and operations in public/private cloud environments.

athacker · ‎05-17-2018

Disclaimer! Yes, despite some great articles around this topic by the finest folks here at Hortonworks, this is YALC [Yet Another article on spinning up Local HDP Clusters]. But before you switch off, this article will show how to spin up HDP/HDF HA, secure clusters, even using customizable OS images, on your local workstation by just executing a single command!! … Glad this caught your attention and you’re still around! Quick Peek! For starters, just to get a taste, once you have ensured you are on a workstation that has at least 4 cores and 8 GB RAM and has open internet connectivity and once you have the environment setup with the pre-requisites, 1. Execute the following in a terminal git clone https://github.com/amolthacker/hwx-local-cluster && cd hwx-local-cluster && ./scripts/create-n-provision.sh 2. Grab a coffee as this could take some time (unfortunately the script does not make coffee for you … yet) 3. Once the script’s completed, you should have a basic HDP 2.6.4 cluster on CentOS 7.4 running on 4 VirtualBox VMs (1 gateway, 1 master and 2 workers) managed by Vagrant equipped with core Hadoop, Hive and Spark all managed by Ambari ! A Look Behind the Scenes So, how did we do this ? First, you’d need to setup your machine environment with the following: VirtualBox Vagrant with hostmanager plugin Packer Ansible Git client python | pip | virtualenv Environment Tested: macOS High Sierra 10.13.4 python 2.7.14 | pip 9.0.3 | virtualenv 15.2.0 Git 2.16.3 VirtualBox 5.2.8 Vagrant 2.0.3 Packer v1.2.2 Ansible 2.3.0.0 At a very high level, the tool essentially uses just a bunch of wrapper scripts that provide the plumbing and automate the process of spinning up a configurable local cluster of virtual machines running over either VirtualBox or VMWare Fusion type 2 hypervisors using a Vagrant base box I’ve created and make these VMs available as a static inventory to the powerful ansible-hortonworks (AH) tool system that does the heavy lifting of automating the provisioning of cluster via Ambari. Shout out to the folks behind the AH project. It automates provisioning HDP /HDF HA secure clusters using Ansible on AWS, Azure, GCP, OpenStack or even on an existing inventory of physical / virtual machines with support for multiple OS. Before proceeding, I would encourage you to acquire a working knowledge of this project. The script create-n-provision runs the tool in its Basic mode where it starts with a base box running CentOS 7.4 that has been baked and hardened with all the pre-requisites needed for any node running HDP/HDF services. The script then uses Vagrant to spin up a cluster driven by a configuration file that helps define parameters such as no.of gateway, master and worker VMs, CPU/Memory configuration, network CIDR and VM name prefixes. The machines created would have names with the following convention: <prefix>-<role>-<num > where, prefix: VM name prefix specified by the user role: gateway, master or worker num: 1 – N, N being the no.of machines in a given role eg: hdp-master-1 The cluster of VMs thus created is then provisioned with my fork of the ansible-hortonworks (AH) project that has minor customizations to help with a static inventory of Vagrant boxes. AH under the hood validates the readiness of the boxes with the pre-requisites and installs any missing artifacts, bootstraps Ambari and then installs and starts rest of the services across the cluster. The service assignment plan including the Ambari server and agents is driven by the dynamic blueprint template under ansible-hortonworks-staging. For the more adventurous! One can also run the tool in its Advanced mode through the bake-create-n-provision script where in instead of using my Vagrant base box, one can bake their own base box starting with a minimal OS ISO (CentOS supported so far) and then continue with the process of cluster creation and provisioning through the steps outlined above. The base box is baked using Packer and Ansible provisioner. One can customize and extend the hwx-base Ansible role to harden the image as per one’s needs to do things like adding additional packages, tools and libraries, add certs especially if you are looking to run this on a workstation from within an enterprise network, hooking up with AD etc. The code to bake and provision the base box re-uses a lot of the bits from the packer-rhel7 project. Take a look at the repo for further details. To Wrap Up … This might seem like an overly complicated way of getting muti-node HDP/HDF clusters running on your local machines. However, the tool essentially provides you with the necessary plumbing and automation to help create lot more nuanced and customizable clusters with just a couple commands – It allows you to create clusters with: Services running in HA mode Secure configurations for both HDP and HDF Customizations that can be purely declarative and configuration driven Custom baked OS images – images that could be either enterprise hardened or pre-baked with certain application specific or environment specific pre-requisites - for the choice of your local hypervisor – VirtualBox, VMWare Fusion The image baking process can be extended to OS choices other than CentOS 7 and for public/private cloud hypervisors (AWS, Azure, GCP and OpenStack) as supported by the AH project. This now gives you the ability to locally provision miniaturized replicas of your prod/non-prod clusters in your datacenter or private/public clouds using a consistent provisioning mechanism and code base

Online	Offline
Last Visited	‎08-06-2020 11:58 AM

Member Since	‎08-16-2019 08:36 AM
Last Visited	‎08-06-2020 11:58 AM
Posts	12
Kudos received	31

Cloudera Community

Re: Access chunk name in Spark / Scala

Re: Can't sync Ambari with LDAP

Flexing your Compute Muscle with Kubernetes [Part ...

Flexing your Compute Muscle with Kubernetes [Part ...

Distributed Pricing Engine using Dockerized Spark ...

Automated provisioning of HDP for Data Governance ...

From an OS ISO To Local HWX Clusters with a Single...