Member since
12-15-2015
15
Posts
1
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
6493 | 12-15-2015 08:35 AM |
11-09-2021
09:25 PM
Installing and configuring Livy on CDH 6.x.x
Livy is a preferred way to run Spark jobs on several Hadoop installations, but not on CDH. While preparing for a CDP migration, one of our use-cases switched to Apache Airflow to run jobs without requiring an edge node or "bastion node" and they wanted to begin using Airflow before the CDP migration, so they asked me to install Livy on a CDH edge node.
A search online for Livy on CDH returned little helpful information, but I did find information on how to download and install it at https://livy.apache.org/
Step 1: Determine which account will be used to run Livy
Linux security will allow an application to access or execute any program or file the executing account can access unless you configure selinux or another access management software. Pick an account to run Livy from.
Step 2: Set up a keytab
You'll need a Kerberos principal. If you use Active Directory principals with your CDH deployment, then this account will be outside of your Hadoop platform. You can use a tool like ktutil to create a keytab for your Kerberos principal.
Step 3: Set up your server to run Livy
Livy requires the basic Hadoop and Spark environment variables.
export JAVA_HOME=/usr/java/default/jre
export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
Step 4: Download and install Livy
Download the Livy package zip file from https://livy.apache.org/download/ using wget. Example: cd /var/tmp
wget https://dlcdn.apache.org/incubator/livy/0.7.1-incubating/apache-livy-0.7.1-incubating-bin.zip
Unzip the resulting zip file unzip apache-livy-0.7.1-incubating-bin.zip
Deploy the package and link the default symlink directory mkdir /opt/livy
mv /var/tmp/apache-livy-0.7.1-incubating-bin /opt/livy/
ln -s /opt/livy/apache-livy-0.7.1-incubating-bin /opt/livy/default
Step 5: Set up livy.conf
Livy places template files that you'll need to copy to "real" files. You need to configure livy.conf. cp /opt/livy/default/conf/livy.conf.template /opt/livy/default/conf/livy.conf
Edit the newly created livy.conf file and add two lines in the commented-out Kerberos section. livy.server.launch.kerberos.principal=${KERBEROS_PRINCIPAL}
livy.server.launch.kerberos.keytab=${KERBEROS KEYTAB}
Specify the full Kerberos principal name and the full path to the keytab.
Step 6: Run the Livy server
Livy server runs as a background process. This article doesn't discuss how to run it as a service that starts automatically.
/opt/livy/default/bin/livy-server start
Step 7: Test Livy
You can use one of the recommended test commands from another node:
curl -X POST --data '{"kind": "spark"}' -H "Content-Type: application/json" http://<LIVY_HOST>:8998/sessions
You can also test from a web browser:
http://<LIVY_HOST>:8998
Disclaimer: This article is contributed by an external user. The steps may not be verified by Cloudera and may not be applicable for all use cases and may be very specific to a particular distribution. Please follow with caution and at your own risk. If needed, raise a support case to get the confirmation.
... View more
Labels:
11-04-2019
05:52 AM
I'm running a cluster with 8 worker nodes configured with 160GB Impala Daemon Memory Limit. The worker nodes each have 370GB RAM and based on a look at the standard Host Memory Usage graph from Cloudera Manager for the nodes, it looks like I have capacity for additional query space.
My question: Does it look like I have room to increase my Impala values to meet my needs? From my viewpoint, I think I have at least another 100GB of headroom, but I don't want to impact Hive or Spark processing that may occur during the same time windows.
I'd like to accomplish the following:
I'd like to allow some queries that tend to overreach on Impala RAM additional capacity to do what they need to do. These queries read some big tables, sometimes with thousands of partitions, and they have a tendency to run out of RAM.
Reduce Impala usage of scratch directories on these large queries. My cluster is storage-constrained, so when Impala goes heavy into the scratch directories, it not only takes a long time for the queries to finish, but the cluster's health starts to show issues.
Currently, I don't have any admission control settings enabled. Any query can use all the available resources. I'd like to increase the available RAM for all of Impala while limiting the RAM for individual queries.
Over the past week, The nodes' host memory usage graph contains the following example peaks:
Peak 1:
Physical Memory Buffers: 2.6G
Physical Memory Caches: 203.5G
Physical Memory Capacity: 370G
Physical Memory Used: 172G
Swap Used: 0K
Peak 2:
Physical Memory Buffers: 2.6G
Physical Memory Caches: 135G
Physical Memory Capacity: 370G
Physical Memory Used: 232G
Swap Used: 768K
During a quiet time, the numbers look like:
Physical Memory Buffers: 70.7M
Physical Memory Caches: 2.8gG
Physical Memory Capacity: 370G
Physical Memory Used: 19.3G
Swap Used: 768K
... View more
Labels:
- Labels:
-
Apache Impala
07-06-2017
08:28 AM
I'm reading through the installation instructions for CDSW at https://www.cloudera.com/documentation/data-science-workbench/latest/topics/cdsw_install.html and found the section to set up a Wildcard DNS domain. In this section, it references a master IP address. I need to understand if this master IP address is the IP of the Cloudera Manager master node, or if it is the IP of the CDSW master. I'm assuming it's the CDSW master, but I don't want to set up a Wildcard DNS domain incorrectly. Thanks, David Webb
... View more
Labels:
05-24-2016
07:59 AM
That fixed it. I used alternatives to install a new alternative to javac, then used it again to configure javac to the new alternative. alternatives --install /usr/bin/javac javac /usr/java/jdk1.7.0_67-cloudera/bin/javac 1
# alternatives --config javac
There are 2 programs which provide 'javac'.
Selection Command
-----------------------------------------------
* 1 /usr/lib/jvm/java-1.6.0-openjdk.x86_64/bin/javac
+ 2 /usr/java/jdk1.7.0_67-cloudera/bin/javac
Enter to keep the current selection[+], or type selection number: 2 Thanks for your help! DaveW
... View more
05-24-2016
06:00 AM
Sure. # java -version
java version "1.7.0_95"
OpenJDK Runtime Environment (rhel-2.6.4.0.el6_7-x86_64 u95-b00)
OpenJDK 64-Bit Server VM (build 24.95-b01, mixed mode)
... View more
05-23-2016
02:18 PM
I'm new to maven, so when I ran mvn install from /root, I got an error: [INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 0.130s
[INFO] Finished at: Mon May 23 17:00:31 EDT 2016
[INFO] Final Memory: 7M/964M
[INFO] ------------------------------------------------------------------------
[ERROR] The goal you specified requires a project to execute but there is no POM in this directory (/root). Please verify you invoked Maven from the correct directory. -> [Help 1] The POM file was in the /tmp/cm_ext/validator directory. [root ~]# cd /tmp/cm_ext/validator
[root validator]# ll
total 5048
-rw-r--r--. 1 root root 5144659 Feb 19 2013 apache-maven-3.0.5-bin.tar.gz
-rw-r--r--. 1 root root 9409 May 23 16:58 pom.xml
-rw-r--r--. 1 root root 476 May 19 14:07 README.md
drwxr-xr-x. 5 root root 4096 May 19 14:07 src I was able to biuld the cm-schema package as suggested. Once I did, I got a new error when I tried to build validator: [ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR] /tmp/cm_ext/validator/src/main/java/com/cloudera/cli/validator/ApplicationConfiguration.java:[30,20] package java.nio.file does not exist
[ERROR] /tmp/cm_ext/validator/src/main/java/com/cloudera/cli/validator/ApplicationConfiguration.java:[31,20] package java.nio.file does not exist
[ERROR] /tmp/cm_ext/validator/src/main/java/com/cloudera/cli/validator/ApplicationConfiguration.java:[72,18] cannot find symbol
symbol : variable Paths
location: class com.cloudera.cli.validator.ApplicationConfiguration
[ERROR] /tmp/cm_ext/validator/src/main/java/com/cloudera/cli/validator/ApplicationConfiguration.java:[71,14] cannot find symbol
symbol : variable Files
location: class com.cloudera.cli.validator.ApplicationConfiguration
[INFO] 4 errors
[ I assume that I need to install java.nio.file or to make sure that it's in a path that can be accessed. I'll look into this, but if anyone has any clues that I can follow, they will be greatly appreciated. Thanks, - DaveW
... View more
05-19-2016
11:54 AM
I've installed apache nifi on one of my CDH 5.7 clusters (linux version CentOS 6.7), but I'd like to manage it from within Cloudera Manger. I did some research on parcels and on CSDs. It looks like this is something I can do, and it doesn't look like it should be too difficult. I came across the github page https://github.com/prateek/nifi-parcel, which gives step-by-step instructions for creating a nifi parcel for Cloudera. Unfortunately, I'm running into errors. The steps instruct me to execute the command to download cloudera/cm_ext and then build it. cd /tmp
git clone https://github.com/cloudera/cm_ext
cd cm_ext/validator
mvn install When I execute maven to install the validator, I ran into a build failure. [WARNING] The POM for com.cloudera.cmf.schema:cloudera-manager-schema:jar:5.5.0 is missing, no dependency information available
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE I assumed that maybe 5.5.0 stands for CDH 5.5.0, so I updated the pom.xml to 5.7.0. Downloading: http://repo.maven.apache.org/maven2/com/cloudera/cmf/schema/cloudera-manager-schema/5.7.0/cloudera-manager-schema-5.7.0.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE ... [ERROR] Failed to execute goal on project schema-validator: Could not resolve dependencies for project com.cloudera.enterprise:schema-validator:jar:5.7.0: Could not find artifact com.cloudera.cmf.schema:cloudera-manager-schema:jar:5.7.0 in cloudera-external (https://repository.cloudera.com/artifactory/ext-release-local/) -> [Help 1] I searched https://repository.cloudera.com/artifactory/ext-release-local/ and found that there's nothing there under the ./com/cloudera directory. Is there a better way to do this?
... View more
Labels:
12-15-2015
08:35 AM
I was able to rebuild the Oozie job and make it work, although I really don't know what is different. I built the job in sequence this time, so that the steps are listed in-sequence in the XML file. I also built the job steps to reference the lib directory in the job's path. I had previously had success with explicit references, but these didn't seem necessary. I moved the prepare steps to a point right before they were needed instead of all on the first step. I eliminated the output directory definition for TeraValidate because it doesn't seem to be used. Finally, I let Hue/Oozie choose the defaults for Master and Mode. I played around with trying to use YARN and cluster, but these didn't work. My resulting XML (that works) looks like this: <workflow-app name="TeraGen-TeraSort-TeraValidate" xmlns="uri:oozie:workflow:0.5"> <start to="spark-27f0"/> <kill name="Kill"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <action name="spark-27f0"> <spark xmlns="uri:oozie:spark-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path="${nameNode}/user/davidw/terasort-benchmark.in"/> </prepare> <master>local[*]</master> <mode>client</mode> <name>TeraGen</name> <class>com.github.ehiggs.spark.terasort.TeraGen</class> <jar>lib/spark-terasort.jar</jar> <arg>1g</arg> <arg>/user/davidw/terasort-benchmark.in</arg> </spark> <ok to="spark-94fc"/> <error to="Kill"/> </action> <action name="spark-94fc"> <spark xmlns="uri:oozie:spark-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path="${nameNode}/user/davidw/terasort-benchmark.out"/> </prepare> <master>local[*]</master> <mode>client</mode> <name>TeraSort</name> <class>com.github.ehiggs.spark.terasort.TeraSort</class> <jar>lib/spark-terasort.jar</jar> <arg>/user/davidw/terasort-benchmark.in</arg> <arg>/user/davidw/terasort-benchmark.out</arg> </spark> <ok to="spark-bcf9"/> <error to="Kill"/> </action> <action name="spark-bcf9"> <spark xmlns="uri:oozie:spark-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <master>local[*]</master> <mode>client</mode> <name>TeraValidate</name> <class>com.github.ehiggs.spark.terasort.TeraValidate</class> <jar>lib/spark-terasort.jar</jar> <arg>/user/davidw/terasort-benchmark.out</arg> </spark> <ok to="End"/> <error to="Kill"/> </action> <end name="End"/> </workflow-app>
... View more