About pandrade

pandrade · ‎05-02-2019

Short description: In this article I am going to create a simple Producer to publish messages(tweets) to a kafka topic. Additionally I'm also creating a simple Consumer that subscribes to the kafka topic and reads the messages Create the kafka topic: ./kafka-topics.sh --create --topic 'kafka-tweets' --partitions 3 --replication-factor 3 --zookeeper <zookeeper node:zk port> Install necessary packages in your python project venv: pip install kafka-python twython Producer: def main(): # Load credentials from json file with open("twitter_credentials.json", "r") as file: creds = json.load(file) # Instantiate python_tweets = Twython(creds['CONSUMER_KEY'], creds['CONSUMER_SECRET']) # search query query = {'q': 'cloudera', 'result_type': 'mixed', 'count': 100} #result is a python dict of tweets result = python_tweets.search(**query)['statuses'] injest_data(result) To get access to twitter API I need to use my credentials which are stored in "twitter_credentials.json". I then use twython to search for 100 tweets that contain word "cloudera" The result is a python dict, that will be the input of injest_data() were I will be connecting to kafka and then send messages to topic "kafka-tweets" def injest_data(list): #serialize dict to string via json and encode to bytes via utf-8 p = KafkaProducer(bootstrap_servers='<kafka-broker>:6667', acks='all',value_serializer=lambda m: json.dumps(m).encode('utf-8'), batch_size=1024) for item in list: p.send('kafka-tweets', value=item) p.flush(100) p.close() Consumer: def consume(): # To consume latest messages and auto-commit offsets and also decode from raw bytes to utf-8 consumer = KafkaConsumer('kafka-tweets', bootstrap_servers=['<kafka-broker>:6667'],value_deserializer=lambda m: json.loads(m.decode('utf-8')),consumer_timeout_ms=10000) for message in consumer: # message value and key are raw bytes -- need to decode print ("%s:%d:%d: key=%s value=%s" % (message.topic, message.partition, message.offset, message.key, message.value)) consumer.close() We are subscribing to "kafka-tweets" topic and then reading the messages Output (1 message): tweets:0:484: key=None value={u'contributors': None, u'truncated': True, u'text': u'Urgent Requirement for an Infrastructure & Platform Engineer to work with one of our top financial clients!!\nApply\u2026 https://t.co/3pGFbOASGj', u'is_quote_status': False, u'in_reply_to_status_id': None, u'id': 1124041974875664390, u'favorite_count': 1, u'source': u'<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>', u'retweeted': False, u'coordinates': None, u'entities': {u'symbols': [], u'user_mentions': [], u'hashtags': [], u'urls': [{u'url': u'https://t.co/3pGFbOASGj', u'indices': [120, 143], u'expanded_url': u'https://twitter.com/i/web/status/1124041974875664390', u'display_url': u'twitter.com/i/web/status/1\u2026'}]}, u'in_reply_to_screen_name': None, u'id_str': u'1124041974875664390', u'retweet_count': 5, u'in_reply_to_user_id': None, u'favorited': False, u'user': {u'follow_request_sent': None, u'has_extended_profile': False, u'profile_use_background_image': False, u'time_zone': None, u'id': 89827370, u'default_profile': False, u'verified': False, u'profile_text_color': u'000000', u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/644912966698037249/unhNPWuL_normal.png', u'profile_sidebar_fill_color': u'000000', u'is_translator': False, u'geo_enabled': True, u'entities': {u'url': {u'urls': [{u'url': u'http://t.co/OJFHBaiwWO', u'indices': [0, 22], u'expanded_url': u'http://www.beach-head.com', u'display_url': u'beach-head.com'}]}, u'description': {u'urls': []}}, u'followers_count': 82, u'protected': False, u'id_str': u'89827370', u'default_profile_image': False, u'listed_count': 8, u'lang': u'en', u'utc_offset': None, u'statuses_count': 2508, u'description': u'Beachhead is a Premier IT recruiting firm based in Toronto, Canada. Follow for exciting opportunities in Financial, Retail and Telecommunication sector.\U0001f600', u'friends_count': 59, u'profile_link_color': u'0570B3', u'profile_image_url': u'http://pbs.twimg.com/profile_images/644912966698037249/unhNPWuL_normal.png', u'notifications': None, u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'profile_background_color': u'000000', u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/89827370/1442594156', u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png', u'name': u'BeachHead', u'is_translation_enabled': False, u'profile_background_tile': False, u'favourites_count': 19, u'screen_name': u'BeachHeadINC', u'url': u'http://t.co/OJFHBaiwWO', u'created_at': u'Sat Nov 14 00:02:15 +0000 2009', u'contributors_enabled': False, u'location': u'Toronto, Canada', u'profile_sidebar_border_color': u'000000', u'translator_type': u'none', u'following': None}, u'geo': None, u'in_reply_to_user_id_str': None, u'possibly_sensitive': False, u'lang': u'en', u'created_at': u'Thu May 02 20:04:25 +0000 2019', u'in_reply_to_status_id_str': None, u'place': None, u'metadata': {u'iso_language_code': u'en', u'result_type': u'recent'}} Code available in: https://github.com/PedroAndrade89/kafka_twitter

pandrade · ‎04-16-2019

Steps on how to setup YARN to run docker containers can be found in Part 1: article In this article I will show how to run Hive components (Hiverserver2, Metastore) as docker containers in YARN. Metastore will be using a Mysql 8 database also running as a docker container in a local host. Pre-requisites: 1. Pull mysql-server image from Docker hub, run image as a docker container, create hive database and permissions for hive user : docker pull mysql/mysql-server docker run -d -p 3306:3306 -e MYSQL_ROOT_PASSWORD=admin --restart=always --name mysqld mysql/mysql-server docker exec -it mysqld bash bash-4.2# mysql -u root --password=admin mysql> CREATE DATABASE hive; mysql> CREATE USER 'hive' IDENTIFIED BY 'hive'; mysql> GRANT ALL PRIVILEGES ON hive.* TO 'hive'@'%' WITH GRANT OPTION; 2. Create user 'hive' and assign to 'hadoop' group: useradd hive usermod -aG hadoop hive Dockerize Hive: 1. Create a yum repo file "hdp.repo" that contains HDP-3.1.0.0 and HDP-UTILS-1.1.0.22 repositories: [HDP-3.1.0.0] name=HDP Version - HDP-3.1.0.0 baseurl=http://public-repo-1.hortonworks.com/HDP/centos7/3.x/updates/3.1.0.0 gpgcheck=1 gpgkey=http://public-repo-1.hortonworks.com/HDP/centos7/3.x/updates/3.1.0.0/RPM-GPG-KEY/RPM-GPG-KEY-Jenkins enabled=1 priority=1 [HDP-UTILS-1.1.0.22] name=HDP-UTILS Version - HDP-UTILS-1.1.0.22 baseurl=http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.22/repos/centos7 gpgcheck=1 gpgkey=http://public-repo-1.hortonworks.com/HDP/centos7/3.x/updates/3.1.0.0/RPM-GPG-KEY/RPM-GPG-KEY-Jenkins enabled=1 priority=1 2. Create the Dockerfile: FROM centos:7 ENV JAVA_HOME /usr/lib/jvm/jre-1.8.0-openjdk COPY hdp.repo /etc/yum.repos.d/ COPY mysql-connector-java-8.0.14-1.el7.noarch.rpm /root/ RUN yum updateinfo \ && yum install -y sudo java-1.8.0-openjdk-devel hadoop-yarn hadoop-mapreduce hive hive-metastore tez \ && yum clean all RUN yum localinstall -y /root/mysql-connector-java-8.0.14-1.el7.noarch.rpm RUN cp /usr/share/java/mysql-connector-java-8.0.14.jar /usr/hdp/3.1.0.0-78/hive/lib/mysql-connector-java.jar Note: For the metastore to connect to our Mysql database we need a JDBC connector. I've downloaded it from here and copied the .rpm file to the same directory as my Dockerfile, so it is installed on the image. 3. Build the image: docker build -t hive . 4. Tag the image and push it to the docker local registry: Tag the image as “<docker registry server>:5000/hive_local”. This creates an additional tag for the existing image. When the first part of the tag is a hostname and port, Docker interprets this as the location of a registry. docker tag hive <docker registry server>:5000/hive_local docker push <docker registry server>:5000/hive_local Now that our hive image is created we will create a Yarn Service configuration file (Yarnfile) with all the details of our service. Deployment: 1. Copy core-site.xml, hdfs-site.xml and yarn-site.xml to hive user dir in HDFS: su - hive hdfs dfs -copyFromLocal /etc/hadoop/conf/core-site.xml . hdfs dfs -copyFromLocal /etc/hadoop/conf/hdfs-site.xml . hdfs dfs -copyFromLocal /etc/hadoop/conf/yarn-site.xml . 2. Create YarnFile (hive.json): { "name": "hive", "lifetime": "-1", "version": "3.1.0.3.1.0.0", "artifact": { "id": "<docker registry server>:5000/hive2", "type": "DOCKER" }, "configuration": { "env": { "HIVE_LOG_DIR": "var/log/hive", "YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS": "/etc/passwd:/etc/passwd:ro,/etc/group:/etc/group:ro", "HADOOP_HOME": "/usr/hdp/3.1.0.0-78/hadoop" }, "properties": { "docker.network": "host" }, "files": [ { "type": "TEMPLATE", "dest_file": "/etc/hadoop/conf/core-site.xml", "src_file": "core-site.xml" }, { "type": "TEMPLATE", "dest_file": "/etc/hadoop/conf/yarn-site.xml", "src_file": "yarn-site.xml" }, { "type": "TEMPLATE", "dest_file": "/etc/hadoop/conf/hdfs-site.xml", "src_file": "hdfs-site.xml" }, { "type": "XML", "dest_file": "/etc/hive/conf/hive-site.xml", "properties": { "hive.zookeeper.quorum": "${CLUSTER_ZK_QUORUM}", "hive.zookeeper.namespace": "hiveserver2", "hive.server2.zookeeper.publish.configs": "true", "hive.server2.support.dynamic.service.discovery": "true", "hive.support.concurrency": "true", "hive.metastore.warehouse.dir": "/user/${USER}/warehouse", "javax.jdo.option.ConnectionUserName": "hive", "javax.jdo.option.ConnectionPassword": "hive", "hive.server2.enable.doAs": "false", "hive.metastore.schema.verification": "true", "hive.metastore.db.type": "MYSQL", "javax.jdo.option.ConnectionDriverName": "com.mysql.jdbc.Driver", "javax.jdo.option.ConnectionURL": "jdbc:mysql://<mysql-server docker host>:3306/hive?createDatabaseIfNotExist=true", "hive.metastore.event.db.notification.api.auth" : "false", "hive.metastore.uris": "thrift://hivemetastore-0.${SERVICE_NAME}.${USER}.${DOMAIN}:9083" } } ] }, "components": [ { "name": "hiveserver2", "number_of_containers": 1, "launch_command": "sleep 25; /usr/hdp/current/hive-server2/bin/hiveserver2", "resource": { "cpus": 1, "memory": "1024" }, "configuration": { "files": [ { "type": "XML", "dest_file": "/etc/hive/conf/hive-site.xml", "properties": { "hive.server2.thrift.bind.host": "${COMPONENT_INSTANCE_NAME}.${SERVICE_NAME}.${USER}.${DOMAIN}", "hive.server2.thrift.port": "10000", "hive.server2.thrift.http.port": "10001" } } ], "env": { "HADOOP_OPTS": "-Xmx1024m -Xms512m" } } }, { "name": "hivemetastore", "number_of_containers": 1, "launch_command": "sleep 5;/usr/hdp/current/hive-metastore/bin/schematool -initSchema -dbType mysql;/usr/hdp/current/hive-metastore/bin/hive --service metastore", "resource": { "cpus": 1, "memory": "1024" }, "configuration": { "files": [ { "type": "XML", "dest_file": "/etc/hive/conf/hive-site.xml", "properties": { "hive.metastore.uris": "thrift://${COMPONENT_INSTANCE_NAME}.${SERVICE_NAME}.${USER}.${DOMAIN}:9083" } } ], "env": { "HADOOP_OPTS": "-Xmx1024m -Xms512m" } } } ] } 3. Deploy application using the YARN Services API: yarn app launch -hive hive.json Test access to Hive: The Registry DNS service that runs on the cluster listens for inbound DNS requests. Those requests are standard DNS requests from users or other DNS servers (for example, DNS servers that have the RegistryDNS service configured as a forwarder) If we have this setup, we can connect via beeline to "${COMPONENT_INSTANCE_NAME}.${SERVICE_NAME}.${USER}.${DOMAIN}" hostname as long as our client is using the corporate DNS server. More details here Because in this test we don't have this configured we need to manually find were the hiveserver2 docker container is running with: curl -X GET 'http://<RM-host>:8088/app/v1/services/hive?user.name=hive' | python -m json.tool On the containers information for component hiveserver2-0 we will find: "containers": [ { "bare_host": "<hiveserver2-0 host hostname>", On the host were the container is running connect to hive via beeline: su - hive beeline -u "jdbc:hive2://<hostname -I>:10000/default" References: https://hadoop.apache.org/docs/r3.1.1/hadoop-yarn/hadoop-yarn-site/yarn-service/Configurations.html http://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/yarn-service/RegistryDNS.html https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/yarn-service/YarnServiceAPI.html Files are also available in the following GitHub repo: https://github.com/PedroAndrade89/docker_hdp_services.git

JonathanSneep · ‎03-22-2019

Excellent article @Pedro Andrade !

Online	Offline
Last Visited	‎06-25-2019 06:59 AM

Member Since	‎05-24-2018 11:17 AM
Last Visited	‎06-25-2019 06:59 AM
Posts	25
Kudos received	7

Cloudera Community

Publishing and consuming tweets to/from Kafka usin...

Running docker containerized services in HDP 3.x P...

Re: Running docker containerized services in HDP 3...