Member since
03-01-2017
34
Posts
2
Kudos Received
0
Solutions
02-18-2021
02:32 AM
1 Kudo
Running 'pyspark' applications in CML for model generation and prediction, with data residing in COD
With the recent addition of Cloudera Operation Database Experience to the CDP Public Cloud, we want to explore how it can be leveraged in the real-life 'DataFlow' end-user scenario. This article talks about how to execute Spark/pyspark job in CML to run modeling task using the data residing in COD. We read the table present in COD and also write back the score table back to the COD once the prediction is done.
Getting Started
CDP Runtime (supporting COD) >=7.2.2
We assume that CDP environment, datalake, datahub (Data Engineering) have been provisioned. We further assume that experiences COD and CML have been provisioned for the CDP target environment.
Note: Please refer to The world’s first enterprise data cloud, if you are just starting with CDP, and get to know how all the requirements can be in place with ease.
Some of the following steps are already documented in this blog (thanks @shlomi Tubul). On top of this, we further elaborate and expanded on what needs to be done for CML-COD use case.
Main components used in this demo:
Cloudera Operational Database (COD), as mentioned in my previous post, is a managed dbPaaS solution available as an experience in Cloudera Data Platform (CDP)
CML is designed for data scientists and ML engineers, enabling them to create and manage ML projects from code to production. Main features of CML:
Development Environment for Data Scientists, Isolated, Containerized, and Elastic
Production ML Toolkit – Deploying, Serving, Monitoring, and Governance of ML models
App Serving – Build and Serve Custom applications for ML use-cases
Setting Up the Environment
The first thing we need to do is to create a database in COD:
Log in to Cloudera Data Platform (CDP) Public Cloud 'Control Plane' (CP)
Select Operational Database and then click Create Database
Select the environment to which the COD will be attached and give a unique name for the COD, and then click Create Database
Once created, open the COD page and use the HBase Client Configuration URL to get the hbase-site.xml needed in CML
Next, Provision CML:
Log in to CDP Public Cloud CP
Select Machine Learning and click Provision Workspace
Select the environment for which the CML workspace will be provisioned and give a unique name for the same, and then click Provision Workspace
Create Project in CML: Model and Prediction
Once CML is provisioned, we go ahead and create a project in the workspace. We will be using the local template and upload the required files to it. create_model_and_score_phoenixTable.py is the pyspark script that we will be using for the task.
CML: Configuration for use in CML session
Upload the configuration files we downloaded from COD (A.4); we will require the hbase-site.xml file for use in the CML session to connect to the COD (see picture above).
We also need to configure the spark-defaults.conf file with jars to be used, and if there are any external cloud storage in use (from where data is being read), we will need to configure that too for Spark to authenticate with IDBroker and get access. Note: Since we have the data in an external S3 bucket, we added appropriate IDBorker mapping to allow the user access to this external bucket.
Running the Task
The pyspark script we used can be found here.
Though the code in this file is written for CML-CDSW integration (for On-prem set-ups), we modified it a little bit to work for the Cloud native platform i.e. CDP Public Cloud.
Firstly we added two lines at the start of the script file- these lines are required as of now to move the hbase-site.xml config to Spark's default conf dir in order for connection to COD to work and allow the file to be read by all users. (There is no way to override this as of now, so this workaround is needed).
Also, we modified the target_path for the temp files (that will be generated by the Spark job), since the user we executed this job (use has been given "MLUser" permission on the environment) needs to have access to the location specified. !cp /home/cdsw/hbase-site.xml /etc/spark/conf/
!chmod 644 /etc/spark/conf/hbase-site.xml
"""""""
same code section from the git file
""""""""
target_path = "<path to the location(in out case, external s3 bucket) where data is residing>"
"""""""
same code section from the git file
""""""""
Rest all is the same in the file.
Start running the project
Click New Session
Give the session a name and click the Start Session button at the bottom (adjust Workbench, kernel, and Resource Profile if required for the project)
Once the session has started, select the pyspark script file, and click the Run icon at the menu on top of the file contents. Once the execution starts, the session logs and task logs tabs will appear on the right half of the screen. The logs will end on completion of the script execution (Success or Failure) There we have it, on Success the table (BatchTable2) gets created in COD. The session can be closed manually by clicking the Stop button at the top right corner (or it will be killed by auto timeout if not in use for a certain amount of time.
... View more
Labels:
10-28-2018
06:30 PM
@Alexander Saip By clean-up, you mean just deleted contents of zookeper logs ? or cleaned up the "hiveserver2" znode ? By the looks (from the log snippet you posted above, the "hiveserver2" znode might not have been created, can you login to zookeeper cli and check: /usr/hdp/current/zookeeper-client/bin/zkCli.sh -server <zookeeper server host name>:2181, and then do a "ls /" on the zopokeeper cli. It should list a "hiverserver2" node there. If missing, try to create it [launch zookeeper cli as hive user (do "sudo su - hive")], and then restart hiveserver2. If this is a secured cluster, you should check for kerberose related errors in the log (could be auth token related issue).
... View more
10-28-2018
06:19 PM
@Mike Lok If you are running HDP2.x, then try following url: jdbc:hive2://<zookeeper server host name>:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2 for HDP3.0, try below: jdbc:hive2://<zookeeper server host name>:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive Also, make sure, Hiveserver2Interective(HSI) is actually running (do a ps -ef |grep llap) on the host where HSI is installed.
... View more
10-25-2018
11:03 AM
@Cody kamat Can you please elaborate a bit more, as to what is the memory usage post enabling LLAP (used/total memory) ? also which HDP version you are using and cluster size ? There are multiple params one should configure, like numbers bodes used by llap, number of llap daemons, llap_heap_size, memory cash per daemon, number of threads and some more like max memory for yarn container and tez container size. The value of all will depend on the cluster config: like memory/node, cpu cores/node and number of nodes. So, please check all these params and let me know cluster details if you want some recommendations from my side. Also, you can refer to this link which probably will clarify things for you: https://community.hortonworks.com/articles/149486/llap-sizing-and-setup.html
... View more
08-01-2018
04:20 PM
@Ashnee Sharma based on little info shared by you, below is my guess- The cache will only come into picture when similar data is being used by queries being run. If there is no overlap of any data in executing queries. Cache has no impact at all (as cache is empty when first query is run), the cache data will keep changing for every new (and data exclusive) query, base son cache size. If, the above is not the case, then pleas share hive-interective server logs to further debug it.
... View more
04-09-2018
06:29 AM
@Saurabh, There seem to be 2 different issues at hand. 1) Make sure the user ID of all users is same across all nodes in the cluster (else, this will cause conflicts, as the NFS permissions configuration usage both the username, userID and groupID) so, as you can see above in ur descreiption -
uid 0594903//where 0 is uid of root on another machine, and 594903 is uid of hdfs which is superuser on datanode machine where NFS gateway is running. This is cause by the same reason, so better you can keep 0 to root, and update the userID for HDFS (but once you do that, you need to update lot of directories to map to the new uid of HDFS user), not sure how complicated this might get, but has to be done. 2) Make sure the user you want to change the ownership too (chown), is part of the config file given when you changed the defualt.fs to NFS. The files (in my case it was user.json and group.json) mentions each user and it's user id (in user.txt), and also the group that all users we want to configure for NFS, goes to groups.txt (group name, and it's id) Example entry in users.json: {
"userName":"root",
"userID":"0"
} Example entry in groups.json: {
"groupName":"root",
"groupID":"0"
} Also, for you to run the 'chown' command, make sure you are doing this as hdfs user (from your above log, it seems only hdfs user can do this). Hope this helps.
... View more
03-22-2018
06:38 AM
@Vani Deeppak Have a look at this article: https://community.hortonworks.com/articles/53531/importing-data-from-teradata-into-hive.html it has the link(s) for Sqoop documentation and also talks about how to use it. Hope this helps.
... View more
02-13-2018
08:30 AM
@Ashnee follow this link: http://eastcirclek.blogspot.in/2016/10/how-to-start-hive-llap-functionality.html should solve your problem.
... View more
12-06-2017
10:10 AM
@Dmitro Vasilenko In the error log avoe, it says memory issue: [pid=15416,containerID=container_e119_1512480218177_0094_01_000002] is running beyond physical memory limits. Current usage: 27.0 GB of 26 GB physical memory used; I think the memoryh settings for llap daemon are beyond the physical available memory. Please check.
... View more
12-05-2017
05:18 AM
@yassine: check the permission on the entire path 'usr/hdp/current/hadoop-client/conf' on all cluster nodes, and make sure hdfs user can access it.
... View more
12-05-2017
05:13 AM
@dmitro: better if you can post the app logs or container level logs. They will have the exact error. TO me it seems like memory issue only and could be related to yarn container size. you can get the app and container level logs this way: yarn logs -applicationId application_1512395880314_0027 yarn logs -containerId container_e115_1512395880314_0027_01_000006 For a full trace for app and container: yarn logs -applicationId application_1512395880314_0027 -containerId container_e115_1512395880314_0027_01_000014
... View more
10-24-2017
05:25 AM
@Jasmin, from the metastore log, what I could find is this line which seems to be of interest: "[org.apache.hadoop.hive.common.JvmPauseMonitor$Monitor@7688c6c2]: common.JvmPauseMonitor (JvmPauseMonitor.java:run(193)) - Detected pause in JVM or host machine (eg GC): pause of approximately 8978ms" this means that there was memory issue in Java heap for hiveserver and it paused GC due to that. Which means probably query itself got hanged. Also following this, I only see query being in progress but not completed (from the log shared). So, it would be a good idea to check the Java heap memory and other relavant parameters. -XX:NewRatio= ?
-XX:MaxHeapFreeRatio= ?
-XX:MinHeapFreeRatio= ?
... View more
10-23-2017
08:29 PM
@Sudheer, when you enable LLAP, the default queue should be changed to LLAP queue and the default % for LLAP queue is 93% (you can this when in yarn queue manager view, after enabling LLAP).
... View more
10-23-2017
08:09 PM
@Jasim can you post the metastore logs here for debugging.
... View more
10-23-2017
08:03 PM
@Sidharth, the most probably point to check would weather iptables is enabled on the newly added host or not ? (It should be disabled for ambari to be able to reach the host or required ports on the new host are open and accessible). The next point would be regarding the ntp server which ambari usage to sync all hosts. Usually the host check can take sometime depending on the configuration. So you should wait and let it finish. Once finished (even with errors) post the same here for further debugging (if none of the above hints solve it). Also refer to this link when you manually install the agent for registration: https://docs.hortonworks.com/HDPDocuments/Ambari-2.2.1.0/bk_ambari_reference_guide/content/ch_amb_ref_installing_ambari_agents_manually.html Regards, Narendra
... View more
10-16-2017
06:36 AM
@Yair Ogen, Can you let me know if you are trying the execute query using hiveserveer 1 or hiveserver2 ? If it is hiveserver1, then please change this in hive config- hive.servere2.enable.doAs= true The issue looks to be related to the user who is being used to execute the particular hive query. On enabling the above config param, for hiverserver1, it enables to execute as the end user rather then expecting as hive user.
... View more
10-10-2017
10:20 AM
@D Giri, Can you post the output of "ls /etc/security/keytabs" here. Along with the component that is part of cluster and fails to start ? My suspect is that we should not put anything in "Principal Suffix" parameter filed when the keytab is created for any service. As, that adds cluster name into the keytab principle where as the service only looks by the username of respective service.
... View more
10-10-2017
07:16 AM
1 : Check if hbase-master is running sudo /etc/init.d/hbase-master status
if not, then start it sudo /etc/init.d/hbase-master start 2 : Check if hbase-regionserver is running sudo /etc/init.d/hbase-regionserver status
if not, then start it sudo /etc/init.d/hbase-regionserver start 3 : Check if zookeeper-server is running sudo /etc/init.d/zookeeper-server status
if not, then start it sudo /etc/init.d/zookeeper-server start 4: Grepping for open port netstat -apn |grep <port to look for> 5: Process memory usage ps -eo pmem,pcpu,vsize,pid,cmd | sort -k 1 -nr | head -5 or ps -A --sort -rss -o comm,pmem | head -n 11 6: Free memory on system: CentOS free -g 7: Yarn application log checking yarn logs -applicationId <application ID> dig deeper via suing the container id along with application id yarn logs -applicationId <application ID> -containerId <container id> 8: starting zkCli cli connection to server cd /grid/0/hdp/current/zookeeper-client/bin And ./zkCli.sh -server <zookeeper server fqdn>:2181 9: starting name node from cli su -l hdfs -c "/usr/hdp/current/hadoop-hdfs-namenode/../hadoop/sbin/hadoop-daemon.sh start namenode" 10: Spark server - starting it when missing hdfs directory hdfs dfs -mkdir /spark-history hdfs dfs -chown -R spark:hadoop /spark-history hdfs dfs -chmod -R 777 /spark-history su - spark -c "/usr/hdp/current/spark-historyserver/sbin/start-history-server.sh" 11: Start History server::: (mapreduce) from cli su -l mapred -c "/usr/hdp/current/hadoop-mapreduce-historyserver/sbin/mr-jobhistory-daemon.sh start historyserver" 12: Starting App timeline server (Yarn) from cli su - yarn /grid/0/hdp/2.5.3.0-37/hadoop-yarn/sbin/yarn-daemon.sh start timelineserver 13: start and stop oozie server from CLI (on the machine where Ozzie server is installed) su oozie /usr/hdp/current/oozie-server/bin/oozied.sh start /usr/hdp/current/oozie-server/bin/oozied.sh stop 14: Hbase mater (Hbase) from cli Go to cluster node where Hbase master is installed, and then- su -l hbase -c "/usr/hdp/current/hbase-master/bin/hbase-daemon.sh start master; sleep 25" 15: Hive2 server (Hive) from cli Go to cluster node where HiveServer2 is installed, then- su hive nohup /usr/hdp/current/hive-server2/bin/hiveserver2 -hiveconf hive.metastore.uris=/tmp/hiveserver2HD.out 2 /tmp/hiveserver2HD.log Or su - hive -l -c 'HIVE_CONF_DIR=/etc/hive/conf /usr/hdp/current/hive-server2/bin/hiveserver2 -hiveconf hive.metastore.uris="" -hiveconf hive.log.dir=/var/log/hive -hiveconf hive.log.file=hiveserver2.log 1>/var/log/hive/hiveserver2.log 2>/var/log/hive/hiveserver2.log &' Or sudo su - -c "export HIVE_CONF_DIR=/tmp/hiveConf;nohup /usr/hdp/current/hive-server2/bin/hiveserver2 -hiveconf hive.metastore.uris=' ' -hiveconf hive.log.file=hiveServer2.log -hiveconf hive.log.dir=/var/log/hive > /var/log/hive/hiveServer2.out 2>> /var/log/hive/hiveServer2.log &" hive [root@stlrx2540m1-109 ~]# ps -ef | grep HiveSer [root@stlrx2540m1-109 ~]# echo 28627 > /var/run/hive/hive-server.pid [root@stlrx2540m1-109 ~]# echo 28627 > /var/run/hive/hive.pid [root@stlrx2540m1-109 ~]# chmod 644 /var/run/hive/hive-server.pid [root@stlrx2540m1-109 ~]# chmod 644 /var/run/hive/hive.pid [root@stlrx2540m1-109 ~]# chown hive:hadoop /var/run/hive/hive-server.pid [root@stlrx2540m1-109 ~]# chown hive:hadoop /var/run/hive/hive.pid
... View more
- Find more articles tagged with:
- app_timeline_server
- Hadoop Core
- HBase
- hiveserver2
- Issue Resolution
- issue-resolution
- namenode
- Oozie
10-10-2017
06:50 AM
1 Kudo
Manual install cluster::: Create VM’s: OS = RHEL6/RHEL7 (depending on ur HDP version) Java = OpenJDK8 Set java home on all the nodes of cluster (make sure you have the correct java path first before setting it. The path value might differ based on what java subversion gets installed)
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk.x86_64 Do the pass-word less ssh
pick one node as amber host, create pub-pri key pair using ssh-keygen, then copy the public key into "/home/root/.ssh/authorized_keys" file on all the nodes Do test is password less ssh works (via doing ssh from the ambari node to all other nodes), if this fails, you need to resolve it first as the entire installation depends on it. NTP set-up (on all nodes)
Yum install -y ntp Start ntp: "/etc/init.d/ntpd start" (RHEL6) "systemctl enable ntpd" and "systemctl start ntpd" (RHEL7) Check "/etc/hosts" file on all hosts: should have all node entries in it
vi /etc/hosts Edit network file (on all nodes)
vi /etc/sysconfig/network NETWORKING=yes HOSTNAME=<fully.qualified.domain.name>
(FQDN of the particular node where you are editing the /etc/sysconfig/network file) Stop IP-tables: some times it might throw some error like "permission denied", you can ignore it)
service iptables stop (RHEL6) systemctl disable firewalld and service firewalld stop (RHEL7) Disable secure linux (all nodes)
setenforce 0 Umask set to 0022 (all nodes)
umask 0022 echo umask 0022 >> /etc/profile Get the amber repo file (on node where amber will be installed)
wget -nv http://s3.amazonaws.com/dev.hortonworks.com/ambari/centos6/2.x/BUILDS/2.4.3.0-35/ambaribn.repo -O /etc/yum.repos.d/ambari.repo (RHEL6, HDP 2.6) wget -nv http://public-repo-1.hortonworks.com/ambari/centos7-ppc/2.x/updates/2.5.0.1/ambari.repo -O /etc/yum.repos.d/ambari.repo (RHEL7, HDP 2.6) Install amber server (amber server node only)
yum install ambari-server If want to use mysql for ambari: Set-up mysql
Check the page: https://docs.hortonworks.com/HDPDocuments/Ambari-2.1.0.0/bk_ambari_reference_guide/content/_using_ambari_with_mysql.html yum install mysql-connector-java Yum install mysql-server Start the service:service mysqld start(when mysql is installed) Do the steps to set-up the DAT file required by the amber install.
When u do 'mysql -u root -p’ is will ask for password, just hit ‘enter’ (i.e. blank password) Do amber server config, and start amber server
ambari-server setup -j $JAVA_HOME (assuming JAVA_HOME is already set, if not set it)
Accept y for temporarily disable SELinux Accept n for Customize user account for ambari-server (assuming ambari runs under "root" user). Accept y for temporarily disabling iptables Select n for advanced database configuration ( select y if you want to set-up ambari with MySQL or any other db which should be already installed on the same node). At Proceed with configuring remote database connection properties [y/n] choose y. This completes the set-up Start and check ambari server
ambari-server start ambari-server status If you want to stop ambari server: ambari-server stop When successful, you should reach amber ui and should work from there on. Happy Installing....
... View more
- Find more articles tagged with:
- Hadoop Core
- hdp-2.5.0
- hdp-2.6.0
- How-ToTutorial
- Installation
Labels:
08-01-2017
06:22 PM
1 Kudo
@hwx : it would a good idea to check the 'hosts' file on the machine running hiverserver2 and check that it contains the entry for datanode host.
... View more
07-21-2017
05:43 AM
@JT Ng As the error itself says, the oozie user is not able to access the file most probably. So, check the permission on the entire path to the file, not just the very last folder in the file path.
... View more
07-21-2017
05:42 AM
@JT Ng As the error itself says, the oozie user is not able to access the file most probably. So, check the permission on the entire path to the file, not just the very last folder in the file path.
... View more
07-20-2017
07:15 AM
@dnyanesh kulkarni The issue seems to be due to no permission to execute the query as end-user on hive server side. Please change the 'Run as end user instead of Hive user' and set it to 'true' the 'hive-site.xml' (hive.server2.enable.doAs=true)
... View more
07-18-2017
05:55 AM
Hi @Krishna S Let me know your email id. I can directly email you there. Or if you are from Hortonworks, you can for sure find me via the hipchat.
... View more
07-15-2017
04:59 PM
@Krishna S Yes, this can be done. You can install all the required components using Ambari with HDFS. The default storage (default file system) can later be changed to NFS. This is a doable configuration, but can/t be posted here. You can email me for your specific requirement.
... View more
07-15-2017
04:19 PM
@srinivas The issue looks to be from java heap memory for data nodes (JvmPauseMonitor error points to that only), maybe due to huge data being handled on your systems. Try increasing the memory for java heap, hopefully that will resolve the issue.
... View more
07-15-2017
03:54 PM
@Bhavin Tandel The jira could be relevant as the slow response usually points to java heap memory issues. But better if you can check the hive server logs and see if any 'outofmemory' errors show-up there. It would be better if you can post log errors here (if any present) for further analysis.
... View more
07-15-2017
03:39 PM
@Abhishek Kumar Please copy hiveserver2.log, history server log, datanode logs etc for understanding the issue and to assist you further on this. Meantime, do check these log files yourself, might be meory related issue (check for 'outofmemory' text in these files), could be folder permission issue and so on. For better targeted answer, need the log files.
... View more
07-11-2017
09:56 AM
hi @Saurab Dahal can you check the yarn application and container logs, and post them here. Any yarn specific issue will be recorded there, and might give the clue as what is gone wrong.
... View more