About Harsh J

Harsh J · ‎12-15-2015

CDH5 is compiled with JDK7, and will not run with a lesser JDK version (such as 6, which is no longer supported). Please install JDK7 from http://archive.cloudera.com/cm5/redhat/6/x86_64/cm/5/RPMS/x86_64/oracle-j2sdk1.7-1.7.0+update67-1.x86_64.rpm or use JDK8 from http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html Please also remove JDK6 to avoid confusion in selection of the right JVM to run with.

Harsh J · ‎12-15-2015

On non-kerberized (insecure) clusters, you can do the below: export HADOOP_USER_NAME=username [command] For ex: export HADOOP_USER_NAME=hdfs yarn logs -applicationId $application_id

Harsh J · ‎12-12-2015

What version of CM are you using, and have you attempted recently to redeploy Spark gateway client configs? The below is what I have out of the box in CM 5.5: #!/usr/bin/env bash ## # Generated by Cloudera Manager and should not be modified directly ## SELF="$(cd $(dirname $BASH_SOURCE) && pwd)" if [ -z "$SPARK_CONF_DIR" ]; then export SPARK_CONF_DIR="$SELF" fi export SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark export DEFAULT_HADOOP_HOME=/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/hadoop ### Path of Spark assembly jar in HDFS export SPARK_JAR_HDFS_PATH=${SPARK_JAR_HDFS_PATH:-''} ### Some definitions needed by older versions of CDH. export SPARK_LAUNCH_WITH_SCALA=0 export SPARK_LIBRARY_PATH=${SPARK_HOME}/lib export SCALA_LIBRARY_PATH=${SPARK_HOME}/lib SPARK_PYTHON_PATH="" if [ -n "$SPARK_PYTHON_PATH" ]; then export PYTHONPATH="$PYTHONPATH:$SPARK_PYTHON_PATH" fi export HADOOP_HOME=${HADOOP_HOME:-$DEFAULT_HADOOP_HOME} if [ -n "$HADOOP_HOME" ]; then LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${HADOOP_HOME}/lib/native fi SPARK_EXTRA_LIB_PATH="/opt/cloudera/parcels/GPLEXTRAS-5.5.0-1.cdh5.5.0.p0.7/lib/hadoop/lib/native" if [ -n "$SPARK_EXTRA_LIB_PATH" ]; then LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$SPARK_EXTRA_LIB_PATH fi export LD_LIBRARY_PATH HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-$SPARK_CONF_DIR/yarn-conf} HIVE_CONF_DIR=${HIVE_CONF_DIR:-/etc/hive/conf} if [ -d "$HIVE_CONF_DIR" ]; then HADOOP_CONF_DIR="$HADOOP_CONF_DIR:$HIVE_CONF_DIR" fi export HADOOP_CONF_DIR PYLIB="$SPARK_HOME/python/lib" if [ -f "$PYLIB/pyspark.zip" ]; then PYSPARK_ARCHIVES_PATH= for lib in "$PYLIB"/*.zip; do if [ -n "$PYSPARK_ARCHIVES_PATH" ]; then PYSPARK_ARCHIVES_PATH="$PYSPARK_ARCHIVES_PATH,local:$lib" else PYSPARK_ARCHIVES_PATH="local:$lib" fi done export PYSPARK_ARCHIVES_PATH fi # Set distribution classpath. This is only used in CDH 5.3 and later. export SPARK_DIST_CLASSPATH=$(paste -sd: "$SELF/classpath.txt")

Harsh J · ‎12-12-2015

Thank you for trying it out. Could you also post your /etc/spark/conf/spark-env.sh contents here, please? P.s. Pro-tip: When using full paths to a file under the parcel, use its symlinks to stay upgrade-compatible, i.e. instead of /opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/, simply use /opt/cloudera/parcels/CDH/.

Harsh J · ‎12-12-2015

What is your NodeManager configuration's yarn.nodemanager.resource.memory-mb value set to? Its possible that YARN is unable to allocate a container for the executors, due to too low value of that configuration, in which case things could hang this way. You could raise that config by another 1 GB and restart the cluster/re-run the shell to see if it resolves the issue. You can also check the Spark AM's log (visit your RM Web UI and click through the RUNNING Spark application, and click on the "logs" link for its Application Master). It may show what it is stuck on, if its yet to spawn up an executor, or if its something else.

Harsh J · ‎12-12-2015

> I guess that if I use --files I use the same log4j.properties for driver and executor. Where are you expecting your logs to be visible BTW? At the driver, or within the executors? Since you are using the yarn-client mode, the custom logger passed via --file will be applied only to the executors. If you'd like it applied to the driver also, via just the use of --file, you will need to use the yarn-cluster mode, as so: spark-submit --name "CentralLog" --master yarn-cluster --class example.spark.CentralLog --files /opt/centralLogs/conf/log4j.properties#log4j.properties --jars $SPARK_CLASSPATH --executor-memory 2g /opt/centralLogs/libProject/produban-paas.jar Otherwise, additonally pass an explicit -Dlog4j.configuration=file:/opt/centralLogs/conf/log4j.properties through spark.driver.extraJavaOptions to make it work, as so: spark-submit --name "CentralLog" --master yarn-client --class example.spark.CentralLog --files /opt/centralLogs/conf/log4j.properties#log4j.properties --conf spark.driver.extraJavaOptions='-Dlog4j.configuration=file:/opt/centralLogs/conf/log4j.properties' --jars $SPARK_CLASSPATH --executor-memory 2g /opt/centralLogs/libProject/produban-paas.jar

Harsh J · ‎12-12-2015

On a parcel installation your PySpark should already be setup to be readily used with spark-submit. Is there a reason you're looking to set the SPARK_HOME and PYTHONPATH variables manually? These are auto-handled by CM for you, via your /etc/spark/conf/spark-env.sh. Does the "spark-submit TestPyEnv.py" in a clean default environment throw an error?

Harsh J · ‎12-09-2015

The role level APIs carry the state, but you're querying the service level. Use the role IDs from the service level to then query the roles directly.

Harsh J · ‎12-08-2015

This would happen if any of your Java based daemons (including the CM server) are running on JDK6. Forcing the daemons to run on JDK7 will resolve the issue (you can remove JDK6 to enforce this, or use JAVA_HOME explicitly in /etc/default/cloudera-scm-server to point to JDK7).

Harsh J · ‎12-08-2015

> As commands in shell scripts are only able to recognize hdfs directories This is an incorrect assumption. The shell action will merely execute any given script file (as normally executed from a process), and does not care about what is within it. Does your script fail with an error? If so, please post the error.

Member Since	‎07-31-2013 07:21 AM
Last Visited
Posts	1,924
Kudos received	461

Cloudera Community

Re: S3Guard Suggested to help fix Consistency

Re: Failed to start namenode. java.io.FileNotFound...

Re: sqoop import issue

Re: Efficient ways to store many images files

Re: S3 loading into HDFS

Re: Unable to format NameNode - PseudoNode cluster

Re: oozie- attach the job logs to email (email act...

Re: PySpark : cannot import name SparkContext

Re: PySpark : cannot import name SparkContext

Re: Pyspark stuck at Stage 0

Re: Config log4j in Spark

Re: PySpark : cannot import name SparkContext

Re: Query the Active Name Node from an Oozie Job

Re: How can we disable TLSv1 cipher for the Cloude...

Re: how to execute oozie shell action with script ...