Member since
08-16-2019
12
Posts
31
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3162 | 04-10-2018 04:35 PM | |
2410 | 04-10-2018 02:17 PM |
01-03-2019
12:16 AM
2 Kudos
This
is the second article in the 2-part series (see Part
1) where we look at yet another architectural variation of the distributed
pricing engine discussed here, this time leveraging a P2P cluster design through
the Akka framework deployed on Kubernetes with a native API Gateway using Ambassador
that in turn leverages Envoy, an L4/L7 proxy, and monitoring infrastructure
powered by Prometheus. The Cloud Native Journey In
the previous
article, we just scratched the surface of one of the top level CNCF projects – Kubernetes - the core container management,
orchestration and scheduling platform. In this article, we will take our cloud
native journey to the next level by exploring more nuanced aspects of
Kubernetes (support for stateful applications through StatefulSets, Service, Ingress)
including the use of package manager for Kubernetes – Helm. We
will also look at the two other top level projects of CNCF – Prometheus and Envoy and how these form the fundamental
building blocks of any production viable cloud native computing stack. Application and Infrastructure Architecture We
employ a peer-to-peer cluster design for the distributed compute engine leveraging
the Akka Cluster
framework. Without going into much details of how the framework works (see documentation
for more), it essentially provides a fault-tolerant decentralized P2P cluster membership service
through a gossip protocol with no SPOF. The cluster nodes need to register
themselves with an ordered set of seed nodes (primary/secondaries) in order to
discover other nodes. This necessitates the seed nodes to have state semantics with
respect to leadership assignment and stickiness of addressable IP addresses/DNS
names. A
pair of nodes (for redundancy) are designated as seed nodes that act as the controller
(ve-ctrl) which: Provides
contact points for the valengine (ve) worker nodes to join the
cluster Exposes
RESTful service (using Akka HTTP)
for job submission from an HTTP client Dispatches
jobs to the registered ve workers The
other non-seed cluster nodes act as stateless workers that are routed compute job
requests and perform the actual pricing task. This facilitates elasticity of
the compute function as worker nodes join and leave the cluster depending on
the compute load. So,
how do we deploy the cluster on Kubernetes? We
have the following 3 fundamental requirements: The seed controller nodes need to
have state semantics with respect to leadership assignment and stickiness of
addressable IP addresses/DNS names The non-seed worker nodes are
stateless compute nodes that need to be elastic (be able to auto-scale in/out based
on compute load) Load-balanced API service to
facilitate pricing job submissions via REST The
application is containerized using Docker,
packaged using Helm and
the containers managed and orchestrated with Kubernetes leveraging Google Kubernetes Engine
(GKE) and using its following core constructs: StatefulSets
- Manages the deployment and scaling of a set of Pods, and provides guarantees about the ordering and uniqueness of
these Pods. Deployment/ReplicaSets
– Manages the deployment and scaling of a homogenous set of stateless pods in
parallel Service
- An abstraction which defines a logical set of Pods and a policy by
which to access them through the EndPoints API that reflects updates in the
state of associated Pods quite
naturally, as follows: The controller (ve-ctrl) seed
nodes are deployed as StatefulSets as
an ordered set with stable (across pod (re)scheduling) network co-ordinates The valengine (ve) workers
are deployed as ReplicaSets managed through a Deployment decorated
with semantics for autoscaling in parallel (as against sequentially as is the
case with StatefulSets) based on compute load The controller pods are
front by a K8s Service exposing
the REST API to submit compute jobs So
far so good. Now how do we access the service and submit pricing and valuation
jobs? Diving a little deeper into the service aspects – our applications K8s Service
is of type ClusterIP making it reachable only from within the K8s cluster on a
cluster-internal private IP. Now
we can externally expose the service using GKE’s load balancer by using the
LoadBalancer service type instead and manage (load balancing, routing, SSL termination
etc.) external access through Ingress.
Ingress is a beta K8s resource and GKE provides an ingress
controller to realize the ingress semantics. Deploying
and managing a set of services benefits from the realization of a service mesh, as well articulated here and necessitates a richer set of native features and responsibilities that can
broadly be organized at two levels – The
Data Plane – Service co-located and service specific proxy responsible for
service discovery, health checks, routing, rate-limiting, load balancing, retries,
auth, observability/tracing of request/responses etc. The
Control Plane – Responsible for centralized policy, configuration definition
and co-ordination of multiple stateless isolated service specific Data Planes
and their abstraction as a singular distributed system In
our application, we employ a K8s native, provider (GKE, AKS, EKS etc.) agnostic
API Gateway with these functions incorporated using Envoy as the L7 sidecar
proxy through an open source K8s native manifestation in the form of Ambassador. Envoy
abstracts the application service from the Data Plane networking functions
described above and the network infrastructure in general. Ambassador
facilitates the Control Plane by providing constructs for service definition/configuration
that speak the native K8s API language and generating a semantic intermediate
representation to apply the configurations onto Envoy. Ambassador also acts as
the ‘ingress controller’ thereby enabling external access to the service. Some of the other technologies one might consider for: Data Plane - Traefik, NGNIX, HAProxy, Linkerd Control Plane - Istio, Contour (Ingress Controller) Finally, the application compute
cluster and service mesh (Ambassador and Envoy) is monitored with Prometheus deployed
using an operator
pattern - an application-specific controller that extends the Kubernetes
API to create, configure, and manage instances of complex stateful applications
on behalf of a Kubernetes user - through the prometheus-operator which
makes the Prometheus configuration K8s native and helps manage the Prometheus
cluster. We deploy the Prometheus-operator
using Helm which in turn deploys the complete toolkit including AlertManager and Grafana
with in-built dashboards for K8s. The metrics from Ambassador and Envoy
are captured by using the Prometheus StatsD Exporter,
fronting it with a Service (StatsD Service) and leveraging the ServiceMonitor
custom resource definition (CRD)
provided by the prometheus-operator to help monitor and abstract configuration
to the StatsD exporting service for Envoy. Tech Stack Kubernetes (GKE) 1.10
Helm 2.11
Prometheus 2.6
Envoy 1.8
Ambassador 0.40.2
Akka 2.x To Try this at Home All application and infrastructure
code is the following repo: https://github.com/amolthacker/hwx-pe-k8s-akka Pre-requisites: GCP access with Compute and
Kubernetes Engine and following client utilities setup:
gcloud kubectl helm Clone the repo Update `scripts/env.sh` with GCP region and zone details Bootstrap the infrastructure with following script: $ ./scripts/setup-cluster.sh Once the script completes you should the above stack up and running and ready to use Controller seed node (ve-ctrl) pods Worker node (ve) pods - base state Behind the Scenes The script automates the following tasks: Enables GCP Compute and Container APIs $ gcloud services enable compute.googleapis.com
$ gcloud services enable container.googleapis.com Creates a GKE cluster with auto-scaling enabled $ gcloud container clusters create ${CLUSTER_NAME} --preemptible --zone ${GCLOUD_ZONE} --scopes cloud-platform --enable-autoscaling --min-nodes 2 --max-nodes 6 --num-nodes 2 Configures kubectl and defines bindings for RBAC $ gcloud container clusters get-credentials ${CLUSTER_NAME} --zone ${GCLOUD_ZONE}
$ kubectl create clusterrolebinding ambassador --clusterrole=cluster-admin --user=${CLUSTER_ADMIN} --serviceaccount=default:ambassador Sets up Helm Deploys the application `hwxpe` using Helm $ helm install --name hwxpe helm/hwxpe
$ kubectl autoscale deployment ve --cpu-percent=50 --min=2 --max=6 This uses the pre-built image amolthacker/ce-akka Optionally one can rebuild the application with Maven $ mvn clean package and use the new image regenerated as part of the build leveraging the docker-maven-plugin To build the application w/o generating Docker image: $ mvn clean package -DskipDocker Deploys the Prometheus infrastructure $ helm install stable/prometheus-operator --name prometheus-operator --namespace monitoring Deploys Ambassador (Envoy) $ kubectl apply -f k8s/ambassador/ambassador.yaml
$ kubectl apply -f k8s/ambassador/ambassador-svc.yaml
$ kubectl apply -f k8s/ambassador/ambassador-monitor.yaml
$ kubectl apply -f k8s/prometheus/prometheus.yaml Adds the application service route mapping to Ambassador $ kubectl apply -f k8s/ambassador/ambassador-ve-svc.yaml And finally displays the Ambassador facilitated service's external IP and forwards ports for Prometheus UI, Grafana and Ambassador to make them accessible from the local browser We are now ready to submit pricing jobs to the compute engine using its REST API Job Submission To submit valuation batches, simply grab the external IP displayed and call the script specifying the batch size as follows: As you can see the script uses external IP 35.194.69.254 associated with the API Gateway (Ambassador/Envoy) fronting the application service and route prefix 've-svc' as provided in the Ambassador service spec above. The service routes the request to the backing controller seed node pods (ve-ctrl) which in turn divides and distributes the valuation batch across the set of worker pods (ve) as follows: Controller Seed Node [ve-ctrl] logs Worker Node [ve] logs As the compute load increases, one would see the CPU usage spike beyond the set threshold and as a result the compute cluster autoscale out (eg: from 2 worker nodes to 4 worker nodes in this case) and then scale back in (eg: from 4 back to the desired min 2 in this case) Worker node [ve] pods - scaled out Autoscaling event log Concluding Thank you for your patience and bearing with the lengthy article ... Hope the prototype walkthrough provided a glimpse of the exciting aspects of building cloud native services and applications !! Credits and References https://kubernetes.io/ https://www.getambassador.io/user-guide/getting-started/ https://www.getambassador.io/reference/statistics#prometheus https://blog.envoyproxy.io/service-mesh-data-plane-vs-control-plane-2774e720f7fc https://docs.helm.sh/using_helm https://github.com/prometheus/statsd_exporter
... View more
Labels:
12-11-2018
04:24 PM
6 Kudos
Article
This is the first of the two
articles (see Part 2) where we will at look at Kubernetes (K8s) as
the container orchestration and scheduling platform for a grid like architectural variation of the
distributed pricing engine discussed
here.
Kubernetes
@ Hortonworks
But wait … before we dive in … why are
we talking about K8s here?
Well, if you missed KubeCon
+ CloudNativeCon, find out more!
Hortonworks
joined CNCF with an increased commitment to cloud native solutions
for big data management.
Learn
more about the Open Hybrid Architecture Initiative
led by Hortonworks to
enable consistent data architectures on-prem and/or across cloud providers and
stay tuned to discover more about how K8s is used across the portfolio of
Hortonworks products – DPS, HDF and HDP to propel big data’s cloud native
journey!
The
Application Architecture
With that said, this prototype is a
stateless compute application – a variation of the
Dockerized
Spark on YARN version described here
- that comprises of a set of
Dockerized microservices written in Go and AngularJS orchestrated with
Kubernetes as depicted below:
The application has two components:
compute-engine
A simple client-server application written
in Go. eodservice simulates the client that
submits valuation requests to a grid of remote compute engine (server)
instances - valengine. The valengine responds to requests by doing static
pricing compute using
QuantLib library
through a
Java
wrapper library
- computelib. The eodservice and valengine
communicate with each other using
Kite RPC (from Koding).
eodservice breaks
job submissions into batches with a max size of 100 (pricing requests) and
the computelib prices them in parallel.
compute-manager
A web-based management user interface for
the compute-engine that helps visualize the K8s cluster, pod
deployment and state and facilitates the following operational tasks: Submission of valuation jobs to the
compute engine Scaling of compute Aggregated compute engine log stream
The backend is written in Go and frontend
with AngularJS
The
App on K8s
So how is this deployed, scheduled
and orchestrated with K8s?
To begin with, The compute-engine's
docker image builds
on top of a
pre-baked
image
of the computelib
FROM amolthacker/mockcompute-base
LABEL maintainer="amolthacker@gmail.com"
# Paths
ENV COMPUTE_BASE /hwx-pe/compute/
# Copy lib and scripts
RUN mkdir -p $COMPUTE_BASE
ADD . $COMPUTE_BASE
RUN chmod +x $COMPUTE_BASE/compute.sh
RUN cp $COMPUTE_BASE/compute.sh /usr/local/bin/.
RUN cp $COMPUTE_BASE/mockvalengine-0.1.0.jar /usr/local/lib/.
# Start Engine
ENTRYPOINT go run $COMPUTE_BASE/valengine.go
# Ports: 6000 (RPC) | 8000 (HTTP-Health)
EXPOSE 6000 8000
The stateless valengine pods are compute-engine Docker
containers exposing ports 6000 for RPC comm and 8000 (HTTP) for health check that
are rolled out with the ReplicaSets controller
through a Deployment and Front
by Service (valengine-prod) of type LoadBalancer so as to
expose the Service using the cloud platform’s load balancer – in this case Azure
Container Service (ACS) with Kubernetes orchestrator. This way the jobs are
load balanced across the set of available valengine pods.
Note: ACS has recently been deprecated
and impeding retirement in favor of
aks-engine for IaaS Kubernetes
deployments and
Azure
Kubernetes Service (AKS)
for a managed PaaS option.
Additionally, the Deployment is
configured with a Horizontal Pod Autoscaler (HPA) for auto-scaling the valengine pods
(between 2 and 8 pods) once a certain CPU utilization threshold is reached.
One can use the compute-manager’s
web interface to
submit
compute batches
stream
logs on the web console and
watch
the engines auto-scale based on the compute workload
The Demo - https://youtu.be/OFkEZKlHIbg The demo will:
First
go through deployment specifics
Submit
a bunch of compute jobs
Watch
the pods auto-scale out
See
the subsequent job submissions balanced across the scaled-out compute engine grid
See
the compute engine grid scale back in after a period of reduced activity
The
Details Prerequisites
Azure
CLI tool (az) K8s
CLI Tool (kubectl) First, we automate the provisioning
of K8s infrastructure on Azure through an ARM template $ az group deployment create -n hwx-pe-k8s-grid-create -g k8s-pe-grid --template-file az-deploy.json
--parameters @az-deploy.parameters.json
Point kubectl to this
K8s cluster $ az acs kubernetes get-credentials --ssh-key-file
~/.ssh/az --resource-group=k8s-pe-grid --name=containerservice-k8s-pe-grid
Start the proxy $ nohup kubectl proxy 2>&1 < /dev/null &
Deploy the K8s cluster – pods, ReplicaSet,
Deployment and Service using kubectl $ kubectl apply -f k8s/
Details on K8s configuration here. We then employ the compute-manager
and the tooling and library dependencies it uses to:
Stream
logs from the valengine pods onto
the web console Talk
to the K8s cluster using v1 APIs through the Go client Submit
compute jobs to the compute engine’s eodservice The compute-manager uses Go’s concurrent
constructs such as goroutines and channels func handleSubmitJob(w http.ResponseWriter, r *http.Request) {
metric := mux.Vars(r)["metric"]
if !contains(metric, SUPPORTED_METRICS) {
http.Error(w, metric+" is not a supported metric", http.StatusInternalServerError)
return
}
numTrades := mux.Vars(r)["numTrades"]
id := rand.Intn(1000)
idStr := strconv.Itoa(id)
now := time.Now()
jobInfo := JobInfo{idStr, metric, numTrades, now, now, "Running"}
addJobInfo(&jobInfo)
nT, _ := strconv.Atoi(numTrades)
batchSize := 100
for nT > batchSize {
nT = nT - batchSize
go runJob(idStr, metric, strconv.Itoa(batchSize))
}
go runJob(idStr, metric, strconv.Itoa(nT))
w.Write([]byte("OK"))
}
func cMapIter() <-chan *JobInfo {
c := make(chan *JobInfo)
f := func() {
cMap.Lock()
defer cMap.Unlock()
var jobs []*JobInfo
for _, job := range cMap.item {
jobs = append(jobs, job)
c <- job
}
close(c)
}
go f()
return c
}
for concurrent asynchronous job submission, job status updates and service health checks. For more, check out the repo. Coming
Up Next … Hope you enjoyed a very basic
introduction to stateless micro-services with Kubernetes and a celebration of
Hortonworks’ participation in CNCF. Next, in Part 2, we will look at another
alternative architecture for this compute function leveraging some of the more
nuanced Kubernetes functions, broader cloud native stack to facilitate proxying and API management and Autoscaling with
Google Kubernetes Engine (GKE).
... View more
Labels:
07-16-2018
01:54 PM
7 Kudos
A distributed compute engine for pricing financial derivatives using QuantLib with Spark running in Docker containers on YARN with HDP 3.0.
Introduction
Modern financial trading and risk platforms employ compute engines for pricing and risk analytics across different asset classes (equities, fixed income, FX etc.) to drive real-time trading decisions and quantitative risk management. Pricing financial instruments involves a range of algorithms from simple cashflow discounting to more analytical methods using stochastic processes such as Black-Scholes for options pricing to more computationally intensive numerical methods such as finite differences, Monte Carlo and Quasi Monte Carlo techniques depending on the instrument being priced – bonds, stocks or their derivatives – options, swaps etc. and the pricing (NPV, Rates etc) and risk (DV01, PV01, higher order greeks such as gamma, vega etc.) metrics being calculated. Quantitative finance libraries, typically written in low level programming languages such as C, C++ leverage efficient data structures and parallel programming constructs to realize the potential of modern multi-core CPU, GPU architectures, and even specialized hardware in the form of FPGAs and ASICs for high performance compute of pricing and risk metrics.
Quantitative and regulatory risk management and reporting imperatives such as valuation adjustment calculations XVA (CVA, DVA, FVA, KVA, MVA) for FRTB, CCAR and DFAST in the US or MiFID in Europe for instance, necessitate valuation of portfolios of millions of trades across tens of thousands of scenario simulations and aggregation of computed metrics across a vast number and combination of dimensions – a data-intensive distributed computing problem that can benefit from:
Distributed compute frameworks such as Apache Spark and Hadoop that offer scale-out shared-nothing, fault-tolerant, and data-parallel architectures that are more portable and have more palatable easier to use APIs, as compared to HPC frameworks such as MPI, OpenMP etc.
Elasticity and operational efficiencies of cloud computing especially with burst compute semantics for these use cases augmented by the use of OS virtualization through containers and lean DevOps practices.
In this article, we will look to capture the very essence of problem space discussed above through a trivial implementation of a compute engine for pricing financial derivatives that combines the facilities of parallel programming through QuantLib, an open source library for quantitative finance embedded in a distributed computing framework Apache Spark running in an OS virtualized environment through Docker containers on Apache Hadoop YARN as the resource scheduler provisioned, orchestrated and managed in OpenStack private cloud through Hortonworks Cloudbreak all through a singular platform in the form of HDP 3.0.
Pricing Semantics
The engine leverages QuantLib to compute:
Spot Price of a Forward Rate Agreement (FRA) using the library’s yield term structure based on flat interpolation of forward rates
NPV of a vanilla fixed-float (6M/EURIBOR-6M) 5yr Interest Rate Swap (IRS)
NPV of a European Equity Put Option averaged over multiple algorithmic calculations (Black-Scholes, Binomial, Monte-Carlo)
The requisite inputs for these calculations – yield curves, fixings, instrument definitions etc. are statically incorporated for the purpose of demonstration.
Dimensional aggregation of these computed metrics for a portfolio of trades is trivially simulated by performing this calculation for a specified number of N times (portfolio size of N trades) and computing a mean.
Technical Details
The engine basically exploits the embarrassingly parallel nature of the problem around independent parallel pricing tasks leveraging Apache Spark in its map phase and computing mean as the trivial reduction operation.
A more real-world consideration of the pricing engine would also benefit from distributed in-memory facilities of Spark for shared ancillary datasets such as market (quotes), derived market (curves), reference (term structures, fixings etc.) and trade data as inputs for pricing, typically subscribed to from the corresponding data services using persistent structures leveraging data-locality in HDFS.
These pricing tasks essentially call QuantLib library functions that can range from scalars to parallel algorithms discussed above to price trades. Building, installing and managing the library and associated dependencies across a cluster of 100s of nodes can be cumbersome and challenging. What if now one wants to run regression tests on a newer version of the library in conjunction with the current?
OS virtualization through Docker is a great way to address these operational challenges around hardware and OS portability, build automation, multi-version dependency management, packaging and isolation and integration with server-side infrastructure typically running on JVMs and of course more efficient higher density utilization of the cluster resources.
Here, we use a Docker image with QuantLib over a layer of lightweight CentOS using Java language bindings through SWIG.
With Hadoop 3.1 in HDP 3.0, YARN’s support for LinuxContainerExecutor beyond the classic DefaultContainerExecutor facilitates running multiple container runtimes – the DefaultLinuxContainerRuntime and now the DockerLinuxContainerRuntime side by side. Docker containers running QuantLib with Spark executors, in this case, are scheduled and managed across the cluster by YARN using its DockerLinuxContainerRuntime.
This facilitates consistency of resource management and scheduling across container runtimes and lets Dockerized applications take full advantage of all of the YARN RM & scheduling aspects including Queues, ACLs, fine-grained sharing policies, powerful placement policies etc.
Also, when running Spark executor in Docker containers on YARN, YARN automatically mounts the base libraries and any other requested libraries as follows:
For more details – I encourage you to read these awsome blogs on containerization in Apache Hadoop YARN 3.1 and containerized Apache Spark on YARN 3.1
This entire infrastructure is provisioned on OpenStack private cloud using Cloudbreak 2.7.0 which first automates the installation of Docker CE 18.03 on CentOS 7.4 VMs and then the installation of HDP 3.0 cluster with Apache Hadoop 3.1 and Apache Spark 2.3 using the new shiny Apache Ambari 2.7.0
Tech Stack
Hortonworks Cloudbreak 2.7.0
Apache Ambari 2.7.0
Hortonworks Data Platform 3.0.0 (Apache Hadoop 3.1, Apache Spark 2.3)
Docker 18.03.1-ce on CentOS 7.4
QuantLib 1.9.2
Please Do Try This at Home
Ensure you have access to a Hortonworks Cloudbreak 2.7.0 instance. You can set one up locally by using this project: https://github.com/amolthacker/hwx-local-cloudbreak.
Please refer to the documentation to meet the prerequisites and setup credentials for the desired cloud provider.
Clone the repo: https://github.com/amolthacker/hwx-pricing-engine
Update the following as desired:
Infrastructure definition under cloudbreak/clusters/openstack/hwx-field-cloud/hwx-pe-hdp3.json. Ensure you refer to right Cloudbreak base image
Ambari blueprint under cloudbreak/blueprints/hwx-pe-hdp3.json
Now upload the following to your Cloudbreak instance:
Ambari Blueprint: cloudbreak/blueprints/hwx-pe-hdp3.json
Recipe to install Docker CE with custom Docker daemon settings: cloudbreak/recipes/pre-ambari/install-docker.sh
Now execute the following using Cloudbreak CLI to provision the cluster:
cb cluster create --cli-input-json cloudbreak/clusters/openstack/hwx-field-cloud/hwx-pe-hdp3.json --name hwx-pe
This will first instantiate a cluster using the cluster definition JSON and the referenced base image, download packages for Ambari and HDP, install Docker (a pre-requisite to running Dockerized apps on YARN) and setup the DB for Ambari and Hive using the recipes and then install HDP 3 using the Ambari blueprint.
Once the cluster is built, you should be able to log into Ambari to verify
Now, we will configure YARN Node Manager to run LinuxContainerExecutor in non-secure mode, just for demonstration purpose, so that all Docker containers scheduled by YARN will run as ‘nobody’ user. Kerberized cluster with cgroups enabled is recommended for production.
Enable Docker Runtime for YARN
Update yarn-site.xml and container-executor.cfg as follows:
A few configurations to note here:
Setting yarn.nodemanager.container-exexutor.class to use LinuxContainerExecutor
Setting min.user.id to a value (50) less than user id of user ‘nobody’ (99)
Mounting /etc/passwd on read-only mode into the Docker containers to expose the spark user
Adding requisite Docker registries to the trusted list
Now restart YARN
SSH into the cluster gateway node and download the following from repo:
compute-engine-spark-1.0.0.jar
$ wget https://github.com/amolthacker/hwx-pricing-engine/raw/master/compute-engine-spark-1.0.0.jar
compute-price.sh
$ wget https://github.com/amolthacker/hwx-pricing-engine/blob/master/compute/scripts/compute-price.sh
Notice the directives around using Docker as executor env for Spark on YARN in client mode.
You should now be ready to simulate a distributed pricing compute using the following command:
./compute-price.sh <metric> <numTrades> <numSplits>
where metric:
FwdRate: Spot Price of Forward Rate Agreement (FRA)
NPV: Net Present Value of a vanilla fixed-float Interest Rate Swap (IRS
OptionPV: Net Present Value of a European Equity Put Option average over multiple algorithmic calcs (Black-Scholes, Binomial, Monte Carlo)
eg: ./compute-price.sh OptionPV 5000 20
And see the job execute as follows:
Wrapping up …
HDP 3.0 is pretty awesome right !!! It went GA on Friday and I can tell you, if a decade ago you thought Hadoop was exciting, this will blow your mind away!!
In this article here, we’ve just scratched the surface and looked at only one of the myriad compute centric aspects of innovation in the platform. For a more detailed read on platform capabilities, direction and unbound possibilities I urge you to read the blog series from folks behind this.
References
https://www.quantlib.org/
https://hortonworks.com/blog/trying-containerized-applications-apache-hadoop-yarn-3-1/
https://hortonworks.com/blog/containerized-apache-spark-yarn-apache-hadoop-3-1/
... View more
05-18-2018
12:57 AM
6 Kudos
Continuing
with the theme of one-click installs from the last post, this time we will see
how one can provision an HDP 2.6.4 cluster equipped with features for data governance
and security through Apache Atlas and Apache Ranger using Cloudbreak in
private/public cloud, again of course, with a single command! To
begin with, you’d need:
Access
to a supported cloud provider (AWS, Azure, GCP, OpenStack) with following resources
configured / identified for the cluster deployment: Region
/ Availability zone Virtual
Private Network / Security Group VM
types / sizes identified for master and worker roles SSH
keypair Access
to a Hortonworks Cloudbreak instance
setup with credentials for the cloud provider(s) of choice [eg: OpenStack, AWS] Cloudbreak CLI installed
on your machine, in path and configured pointing to the Cloudbreak instance
above You can use this
tool to setup Cloudbreak with CLI configured locally on your
workstation with just a couple of commands. Once this is done, you should have
your Cloudbreak instance setup with the credentials as follows: You should also be able to use the Cloudbreak
CLI to talk to the Cloudbreak instance and in turn the cloud provider with
their respective credentials. Now, clone this repo. Update
cloud configurations under: cloudbreak/clusters/aws/hwx/hwx-aws-dm.json for AWS
cloudbreak/clusters/openstack/hwx-field-cloud/hwx-os-dm.json for OpenStack
especially: general.credentialName // credential to use as configured in Cloudbreak
tags // tags for billing, ops and audit
placement // region and availability zone
network // configured virtual private network
instanceGroups.template.instanceType // instance types
instanceGroups.template.securityGroup // configured security group associated with the network
stackAuthentication // configured SSH key details
Setup Cloudbreak and Create Cluster If running for the first time, run: ./scripts/import-artifacts-n-create-cluster.sh <CLOUD> [where CLOUD => 'openstack' or 'aws'] This will first import the blueprint (under cloudbreak/blueprints) and recipes (under cloudbreak/recipes) into Cloudbreak and
then create the cluster If the blueprints and recipes have already been imported, run: ./scripts/create-cluster.sh <CLOUD> You should now see the cluster being
provisioned ... I also initiated provisioning of the
cluster in AWS ./scripts/create-cluster.sh aws After about 15-20 mins … You
have HDP clusters setup with Atlas, Ranger and their dependencies including
HBase and Kafka all equipped with features for data governance and security!! A quick peek behind the
scenes! The
bulk of the provisioning automation is done through the Ambari blueprint. In
this case, cloudbreak/blueprints/hdp26-data-mgmnt.json One
of the first steps and provisioning script does is import this custom Ambari
blueprint into Cloudbreak using the CLI - visible in
the catalog alongside 3 built-in ones. Ambari
blueprints are the declarative definition of your clusterdefining the host groups and
which components to install on which host group. Ambari uses them as a base for
your clusters. In general, the blueprint
to be used with Cloudbreak should include the following elements: The
blueprint for this setup has 2 host_groups
– master and worker
for the respective service roles of all the components. The
configurations section of the blueprint defines mostly the non-default override
properties for various services especially Atlas and Ranger. Refer
to the docs for more on Ambari support for Ranger
and Atlas.
You can also look at sample
blueprints for shared services. To automate provisioning of more
nuanced clusters, you can manually build the clusters through Ambari
installation wizard and export the blueprint. However, the “blueprint_name" is not included in the
export. You must add it before the blueprint can be used by Cloudbreak. You can
use these Gists for exporting and adding
blueprint_name to the
exported blueprint. This setup uses Postgres as the
RDBMS required to store metadata and state information for Ambari and other
services like Hive, Ranger etc. (You can update the blueprint to use other
supported RDBMSes). In order for Ranger to use this
Postgres instance, one needs to create a DB and a DB user for Ranger and add
authentication information in pg_hba.conf.
This is automated using a ‘pre-ambari-start’
recipe under recipes/ranger-db-setup.sh that
runs on the Ambari server node group. Recipes are scripts that
run on selected node groups before or after cluster installation to help
automate scenarios not handled by Ambari blueprints. See here
for more on Cloudbreak recipes. The provisioning script imported
these recipes into Cloudbreak using the CLI, making them available for
automation of pre/post cluster install steps. The
2 other recipes you see are: ranger-db-setup-aws: The AWS counterpart for
the ranger-db-setup recipe with customizations
for Amazon Linux image restart-atlas: Restarts Atlas post
install for completion The
last set of artifacts that the provisioning script imports into Cloudbreak are
the cluster configurations, in this case, for OpenStack and AWS under: cloudbreak/clusters/aws/hwx/hwx-aws-dm.json for AWS
cloudbreak/clusters/openstack/hwx-field-cloud/hwx-os-dm.json for OpenStack with the following details: general.credentialName // credential to use as configured in Cloudbreak
tags // tags for billing, ops and audit
placement // region and availability zone
network // configured virtual private network
instanceGroups.template.instanceType // instance types
instanceGroups.template.securityGroup // configured security group associated with the network
stackAuthentication // configured SSH key details Once all the
artifacts are in place the provisioning script uses the following CLI command
to initiate cluster install: cb cluster create --cli-input-json $config_file --name $cluster_name To Wrap Up … Hope
you enjoyed the article and see how Apache Ambari and Hortonworks Cloudbreak
facilitate devops and automation for provisioning complex Hadoop/Spark clusters
equipped with enterprise features for governance, security and operations in public/private
cloud environments.
... View more
05-17-2018
11:32 PM
5 Kudos
Disclaimer! Yes, despite some great
articles around this topic by the finest folks here at Hortonworks, this
is YALC [Yet Another article on spinning up Local HDP Clusters]. But before you switch off, this article will show how to
spin up HDP/HDF HA, secure clusters, even using customizable OS images, on your
local workstation by just executing a single command!! … Glad this caught your attention and you’re still around! Quick Peek! For starters, just to get a taste, once you have ensured you
are on a workstation that has at least 4 cores and 8 GB RAM and has open
internet connectivity and once you have the environment setup with the pre-requisites, 1. Execute the following in a terminal git clone https://github.com/amolthacker/hwx-local-cluster && cd hwx-local-cluster && ./scripts/create-n-provision.sh
2. Grab a coffee as this could take some time
(unfortunately the script does not make coffee for you … yet) 3. Once the script’s completed, you should have a basic
HDP 2.6.4 cluster on CentOS 7.4 running on 4 VirtualBox VMs (1 gateway, 1
master and 2 workers) managed by Vagrant equipped with core Hadoop, Hive and
Spark all managed by Ambari ! A Look Behind the
Scenes So, how did we do this ? First, you’d need to setup your machine environment with the
following:
VirtualBox Vagrant with hostmanager plugin Packer Ansible Git client python | pip | virtualenv Environment Tested: macOS High Sierra 10.13.4
python 2.7.14 | pip 9.0.3 | virtualenv 15.2.0
Git 2.16.3
VirtualBox 5.2.8
Vagrant 2.0.3
Packer v1.2.2
Ansible 2.3.0.0
At a very high level, the tool essentially uses just a bunch
of wrapper scripts that provide the plumbing and automate the process of
spinning up a configurable local cluster of virtual machines running over
either VirtualBox or VMWare Fusion type 2 hypervisors using a Vagrant base
box I’ve created and make these VMs available as a static inventory
to the powerful
ansible-hortonworks (AH) tool system that does the heavy lifting of
automating the provisioning of cluster via Ambari. Shout out to the folks behind the AH project. It automates provisioning HDP /HDF HA secure clusters using Ansible
on AWS, Azure, GCP, OpenStack or even on an existing inventory of physical /
virtual machines with support for multiple OS. Before proceeding, I would
encourage you to acquire a working knowledge of this project. The script create-n-provision runs the tool in its Basic mode where it starts
with a base box running CentOS 7.4 that has been baked and hardened with all
the pre-requisites needed for any node running HDP/HDF services. The script
then uses Vagrant to spin up a cluster driven by a configuration file that
helps define parameters such as no.of gateway, master and worker VMs, CPU/Memory
configuration, network CIDR and VM name prefixes. The machines created would
have names with the following convention: <prefix>-<role>-<num > where,
prefix: VM name prefix specified by the user
role: gateway, master or worker
num: 1 – N, N being the no.of machines in a given role
eg: hdp-master-1
The cluster of VMs thus created is then provisioned with my
fork of the ansible-hortonworks (AH) project that has minor
customizations to help with a static inventory of Vagrant boxes. AH under the
hood validates the readiness of the boxes with the pre-requisites and installs
any missing artifacts, bootstraps Ambari and then installs and starts rest of
the services across the cluster. The service assignment plan including the
Ambari server and agents is driven by the dynamic blueprint template under ansible-hortonworks-staging. For the more adventurous! One can also run the tool in its Advanced mode
through the bake-create-n-provision script where in instead of using my Vagrant
base box, one can bake their own base box starting with a minimal OS ISO
(CentOS supported so far) and then continue with the process of cluster
creation and provisioning through the steps outlined above. The base box is
baked using Packer and Ansible provisioner. One can customize and extend the hwx-base
Ansible role to harden the image as per one’s needs to do things like adding
additional packages, tools and libraries, add certs especially if you are
looking to run this on a workstation from within an enterprise network, hooking
up with AD etc. The code to bake and provision the base box re-uses a lot of
the bits from the packer-rhel7
project. Take a look at the repo for
further details. To Wrap Up … This might seem like an overly complicated way of getting
muti-node HDP/HDF clusters running on your local machines. However, the tool essentially
provides you with the necessary plumbing and automation to help create lot more
nuanced and customizable clusters with just a couple commands – It allows you
to create clusters with: Services running in HA mode Secure configurations for both HDP and HDF Customizations that can be purely declarative
and configuration driven Custom baked OS images – images that could be
either enterprise hardened or pre-baked with certain application specific or
environment specific pre-requisites - for the choice of your local hypervisor –
VirtualBox, VMWare Fusion The image baking process can be extended to OS
choices other than CentOS 7 and for public/private cloud hypervisors (AWS,
Azure, GCP and OpenStack) as supported by the AH project. This now gives you
the ability to locally provision miniaturized replicas of your prod/non-prod
clusters in your datacenter or private/public clouds using a consistent
provisioning mechanism and code base
... View more
Labels:
04-10-2018
07:51 PM
1 Kudo
In the most trivial manner,
val fileNameFromPath: String => String = _.split(“/“).takeRight(1).head
import org.apache.spark.sql.functions.udf
val fileNameUDF = udf(fileNameFromPath)
var df2 = df.withColumn(“fileName”, fileNameUDF(input_file_name()))
... View more
04-10-2018
04:35 PM
1 Kudo
Zack, I would use the input_file_name function to update df with the file name column var df2 = df.withColumn("fileName", input_file_name())
... View more
04-10-2018
02:17 PM
Bill, LDAP: error code 49 / data 52e indicates an Active Directory (AD) AcceptSecurityContext error, which is returned when the username is valid but the combination of password and user credential is invalid. You might want to check the admin credentials (/password) that you are using are as expected and probably even password policies for this user in AD. Since you don't seem to be binding anonymously, I'm assuming you are providing the right manager password. Also, since you are using SSL, assuming the certs are imported fine in jks or default JDK keystore. You might also want to ensure the group entry in groups.txt is based off of groupMembershipAttr.
... View more