Created on 04-16-2019 03:13 PM
Steps on how to setup YARN to run docker containers can be found in Part 1: article
In this article I will show how to run Hive components (Hiverserver2, Metastore) as docker containers in YARN.
Metastore will be using a Mysql 8 database also running as a docker container in a local host.
Pre-requisites:
1. Pull mysql-server image from Docker hub, run image as a docker container, create hive database and permissions for hive user :
docker pull mysql/mysql-server docker run -d -p 3306:3306 -e MYSQL_ROOT_PASSWORD=admin --restart=always --name mysqld mysql/mysql-server docker exec -it mysqld bash bash-4.2# mysql -u root --password=admin mysql> CREATE DATABASE hive; mysql> CREATE USER 'hive' IDENTIFIED BY 'hive'; mysql> GRANT ALL PRIVILEGES ON hive.* TO 'hive'@'%' WITH GRANT OPTION;
2. Create user 'hive' and assign to 'hadoop' group:
useradd hive usermod -aG hadoop hive
Dockerize Hive:
1. Create a yum repo file "hdp.repo" that contains HDP-3.1.0.0 and HDP-UTILS-1.1.0.22 repositories:
- [HDP-3.1.0.0]
- name=HDP Version - HDP-3.1.0.0
- baseurl=http://public-repo-1.hortonworks.com/HDP/centos7/3.x/updates/3.1.0.0
- gpgcheck=1
- gpgkey=http://public-repo-1.hortonworks.com/HDP/centos7/3.x/updates/3.1.0.0/RPM-GPG-KEY/RPM-GPG-KEY-Jenkins
- enabled=1
- priority=1
- [HDP-UTILS-1.1.0.22]
- name=HDP-UTILS Version - HDP-UTILS-1.1.0.22
- baseurl=http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.22/repos/centos7
- gpgcheck=1
- gpgkey=http://public-repo-1.hortonworks.com/HDP/centos7/3.x/updates/3.1.0.0/RPM-GPG-KEY/RPM-GPG-KEY-Jenkins
- enabled=1
- priority=1
2. Create the Dockerfile:
FROM centos:7 ENV JAVA_HOME /usr/lib/jvm/jre-1.8.0-openjdk COPY hdp.repo /etc/yum.repos.d/ COPY mysql-connector-java-8.0.14-1.el7.noarch.rpm /root/ RUN yum updateinfo \ && yum install -y sudo java-1.8.0-openjdk-devel hadoop-yarn hadoop-mapreduce hive hive-metastore tez \ && yum clean all RUN yum localinstall -y /root/mysql-connector-java-8.0.14-1.el7.noarch.rpm RUN cp /usr/share/java/mysql-connector-java-8.0.14.jar /usr/hdp/3.1.0.0-78/hive/lib/mysql-connector-java.jar
Note: For the metastore to connect to our Mysql database we need a JDBC connector. I've downloaded it from here and copied the .rpm file to the same directory as my Dockerfile, so it is installed on the image.
3. Build the image:
- docker build -t hive .
4. Tag the image and push it to the docker local registry:
Tag the image as “<docker registry server>:5000/hive_local”. This creates an additional tag for the existing image. When the first part of the tag is a hostname and port, Docker interprets this as the location of a registry.
- docker tag hive <docker registry server>:5000/hive_local
- docker push <docker registry server>:5000/hive_local
Now that our hive image is created we will create a Yarn Service configuration file (Yarnfile) with all the details of our service.
Deployment:
1. Copy core-site.xml, hdfs-site.xml and yarn-site.xml to hive user dir in HDFS:
- su - hive
hdfs dfs -copyFromLocal /etc/hadoop/conf/core-site.xml .
hdfs dfs -copyFromLocal /etc/hadoop/conf/hdfs-site.xml .
hdfs dfs -copyFromLocal /etc/hadoop/conf/yarn-site.xml .
2. Create YarnFile (hive.json):
{ "name": "hive", "lifetime": "-1", "version": "3.1.0.3.1.0.0", "artifact": { "id": "<docker registry server>:5000/hive2", "type": "DOCKER" }, "configuration": { "env": { "HIVE_LOG_DIR": "var/log/hive", "YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS": "/etc/passwd:/etc/passwd:ro,/etc/group:/etc/group:ro", "HADOOP_HOME": "/usr/hdp/3.1.0.0-78/hadoop" }, "properties": { "docker.network": "host" }, "files": [ { "type": "TEMPLATE", "dest_file": "/etc/hadoop/conf/core-site.xml", "src_file": "core-site.xml" }, { "type": "TEMPLATE", "dest_file": "/etc/hadoop/conf/yarn-site.xml", "src_file": "yarn-site.xml" }, { "type": "TEMPLATE", "dest_file": "/etc/hadoop/conf/hdfs-site.xml", "src_file": "hdfs-site.xml" }, { "type": "XML", "dest_file": "/etc/hive/conf/hive-site.xml", "properties": { "hive.zookeeper.quorum": "${CLUSTER_ZK_QUORUM}", "hive.zookeeper.namespace": "hiveserver2", "hive.server2.zookeeper.publish.configs": "true", "hive.server2.support.dynamic.service.discovery": "true", "hive.support.concurrency": "true", "hive.metastore.warehouse.dir": "/user/${USER}/warehouse", "javax.jdo.option.ConnectionUserName": "hive", "javax.jdo.option.ConnectionPassword": "hive", "hive.server2.enable.doAs": "false", "hive.metastore.schema.verification": "true", "hive.metastore.db.type": "MYSQL", "javax.jdo.option.ConnectionDriverName": "com.mysql.jdbc.Driver", "javax.jdo.option.ConnectionURL": "jdbc:mysql://<mysql-server docker host>:3306/hive?createDatabaseIfNotExist=true", "hive.metastore.event.db.notification.api.auth" : "false", "hive.metastore.uris": "thrift://hivemetastore-0.${SERVICE_NAME}.${USER}.${DOMAIN}:9083" } } ] }, "components": [ { "name": "hiveserver2", "number_of_containers": 1, "launch_command": "sleep 25; /usr/hdp/current/hive-server2/bin/hiveserver2", "resource": { "cpus": 1, "memory": "1024" }, "configuration": { "files": [ { "type": "XML", "dest_file": "/etc/hive/conf/hive-site.xml", "properties": { "hive.server2.thrift.bind.host": "${COMPONENT_INSTANCE_NAME}.${SERVICE_NAME}.${USER}.${DOMAIN}", "hive.server2.thrift.port": "10000", "hive.server2.thrift.http.port": "10001" } } ], "env": { "HADOOP_OPTS": "-Xmx1024m -Xms512m" } } }, { "name": "hivemetastore", "number_of_containers": 1, "launch_command": "sleep 5;/usr/hdp/current/hive-metastore/bin/schematool -initSchema -dbType mysql;/usr/hdp/current/hive-metastore/bin/hive --service metastore", "resource": { "cpus": 1, "memory": "1024" }, "configuration": { "files": [ { "type": "XML", "dest_file": "/etc/hive/conf/hive-site.xml", "properties": { "hive.metastore.uris": "thrift://${COMPONENT_INSTANCE_NAME}.${SERVICE_NAME}.${USER}.${DOMAIN}:9083" } } ], "env": { "HADOOP_OPTS": "-Xmx1024m -Xms512m" } } } ] }
3. Deploy application using the YARN Services API:
yarn app launch -hive hive.json
Test access to Hive:
The Registry DNS service that runs on the cluster listens for inbound DNS requests. Those requests are standard DNS requests from users or other DNS servers (for example, DNS servers that have the RegistryDNS service configured as a forwarder) If we have this setup, we can connect via beeline to "${COMPONENT_INSTANCE_NAME}.${SERVICE_NAME}.${USER}.${DOMAIN}" hostname as long as our client is using the corporate DNS server.
More details here
Because in this test we don't have this configured we need to manually find were the hiveserver2 docker container is running with:
curl -X GET 'http://<RM-host>:8088/app/v1/services/hive?user.name=hive' | python -m json.tool
On the containers information for component hiveserver2-0 we will find:
"containers": [ { "bare_host": "<hiveserver2-0 host hostname>",
On the host were the container is running connect to hive via beeline:
su - hive beeline -u "jdbc:hive2://<hostname -I>:10000/default"
References:
https://hadoop.apache.org/docs/r3.1.1/hadoop-yarn/hadoop-yarn-site/yarn-service/Configurations.html
http://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/yarn-service/RegistryDNS.html
https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/yarn-service/YarnServiceAPI.html
Files are also available in the following GitHub repo: