About jzhang

jzhang · ‎04-17-2018

Sorry, mongodb interpreter is a not zeppelin's builtin interpreter, I don't know its mechimism

jzhang · ‎05-26-2017

Introduction For a simple PySpark application, you can use `--py-files` to specify its dependencies. A large PySpark application will have many dependencies, possibly including transitive dependencies. Sometimes a large application needs a Python package that has C code to compile before installation. And, there are times when you might want to run different versions of python for different applications. For such scenarios with large PySpark applications, `--py-files` is inconvenient. Fortunately, in the Python world you can create a virtual environment as an isolated Python runtime environment. We recently enabled virtual environments for PySpark in distributed environments. This eases the transition from local environment to distributed environment with PySpark. In this article, I will talk about how to use virtual environment in PySpark. (This feature is currently only supported in yarn mode.) Prerequisites Hortonworks supports two approaches for setting up a virtual environment: virtualenv and conda. All nodes must have either virtualenv or conda installed, depending on which virtual environment tool you choose. Either virtualenv or conda should be installed in the same location on all nodes across the cluster. To install virtualenv, see https://virtualenv.pypa.io/en/stable/installation/ Note that pip is required to run virtualenv; for pip installation instructions, see https://pip.pypa.io/en/stable/installing/. To install conda, see https://docs.continuum.io/anaconda/install. Each node must have internet access (for downloading packages). Python 2.7 or Python 3.x must be installed (pip is also installed). Now I will talk about how to set up a virtual environment in PySpark, using virtualenv and conda. There are two scenarios for using virtualenv in pyspark: Batch mode, where you launch the pyspark app through spark-submit. Interactive mode, using a shell or interpreter such as pyspark-shell or zeppelin pyspark. In HDP 2.6 we support batch mode, but this post also includes a preview of interactive mode. Batch mode For batch mode, I will follow the pattern of first developing the example in a local environment, and then moving it to a distributed environment, so that you can follow the same pattern for your development. Using virtualenv In this example we will use the following piece of code. This piece of code uses numpy in each map function. We save the code in a file named spark_virtualenv.py. from pyspark import SparkContext if __name__ == "__main__": sc = SparkContext(appName="Simple App") import numpy as np sc.parallelize(range(1,10)).map(lambda x : np.__version__).collect() Using virtualenv in the Local Environment First we will create a virtual environment in the local environment. We highly recommend that you create an isolated virtual environment locally first, so that the move to a distributed virtualenv will be more smooth. We use the following command to create and set up env_1 in the local environment: virtualenv env_1 -p /usr/local/bin/python3 # create virtual environment env_1 Folder env_1 will be created under the current working directory. You should specify the python version, in case you have multiple versions installed. Next, activate the virtualenv: source env_1/bin/activate # activate virtualenv After that you can run PySpark in local mode, where it will run under virtual environment env_1. You will see a "No module" error because numpy is not installed in this virtual environment. So, now let’s install numpy through pip: pip install numpy # install numpy After installing numpy, you can use numpy in PySpark apps launched by spark-submit in your local environment. Use the following command: bin/spark-submit --master local spark_virtualenv.py Using virtualenv in a Distributed Environment Now let’s move this into a distributed environment. There are two steps for moving from a local development to a distributed environment. Create a requirements file which contains the specifications of your third party Python dependencies. The following command will put all of the installed Python package info in the current virtual environment into this file, so keep to stay in the virtual environment you created above. pip freeze > requirements.txt Here’s sample output from the requirements file: numpy==1.12.0 Run the pyspark app through spark-submit. Use the following commands to launch pyspark in yarn-client mode: spark-submit --master yarn-client --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/Users/jzhang/github/spark/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/Users/jzhang/anaconda/bin/virtualenv --conf spark.pyspark.python=/usr/local/bin/python3 spark_virtualenv.py You will see the following output, which shows that we have installed numpy on each executor successfully: Using conda Next, I will talk about how to create a virtual environment using conda. The process is very similar to virtualenv, but uses different commands. Here is the command to create virtual environment env_conda_1 with Python 2.7 in the local environment. Folder env_conda_1 will be created under the current working directory: conda create --prefix env_conda_1 python=2.7 Use the following command to activate the virtual environment:. source activate env_conda_1 // activate this virtual environment Next, install numpy using the conda install command: conda install numpy Use the following command to create the requirements file. This command will put all of the installed Python package info into this file, so keep to stay in this virtual environment you created above. conda list --export > requirements_conda.txt Run the pyspark job in yarn-client mode: bin/spark-submit --master yarn-client --conf spark.pyspark.virtualenv.enabled=true--conf spark.pyspark.virtualenv.type=conda--conf spark.pyspark.virtualenv.requirements=/Users/jzhang/github/spark/requirements_conda.txt --conf spark.pyspark.virtualenv.bin.path=/Users/jzhang/anaconda/bin/condaspark_virtualenv.py You will see output similar to the following, which looks the same as the local example: Interactive mode (Preview) Interactive mode is not yet supported; the following information is a preview. Interactive mode means that you don’t need to specify the requirements file when launching PySpark, and you can install packages in your virtualenv at runtime. Interactive mode is very useful for pyspark shell and notebook environments. Using Interactive Mode with virtualenv The following command launches the pyspark shell with virtualenv enabled. In the Spark driver and executor processes it will create an isolated virtual environment instead of using the default python version running on the host. bin/pyspark --master yarn-client --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native--conf spark.pyspark.virtualenv.bin.path=/Users/jzhang/anaconda/bin/virtualenv--conf spark.pyspark.python=/Users/jzhang/anaconda/bin/python After you launch this pyspark shell, you will have a clean python runtime environment on both driver and executors. You can use sc.install_packages to install any python packages that could could be installed by pip; for example: sc.install_packages(“numpy”) # install the latest numpy sc.install_packages(“numpy==1.11.0”) # install a specific version of numpy sc.install_packages([“numpy”, “pandas”]) # install multiple python packages After that, you can use the packages that you just installed: import numpy sc.range(4).map(lambda x: numpy.__version__).collect() Using Interactive Mode with conda Interactive mode with conda is almost the same as with virtualenv. One exception is that you need to specify spark.pyspark.virtualenv.python_version, because conda needs to specify a python version to create the virtual environment. bin/pyspark --master yarn-client --conf spark.pyspark.virtualenv.enabled=true--conf spark.pyspark.virtualenv.type=conda--conf spark.pyspark.virtualenv.bin.path=/Users/jzhang/anaconda/bin/conda--conf spark.pyspark.virtualenv.python_version=3.5 PySpark VirtualEnv Configurations Property Description spark.pyspark.virtualenv.enabled Property flag to enable virtualenv spark.pyspark.virtualenv.type Type of virtualenv. Valid values are “native”, “conda” spark.pyspark.virtualenv.requirements Requirements file (optional, not required for interactive mode) spark.pyspark.virtualenv.bin.path The location of virtualenv executable file for type native or conda executable file for type conda spark.pyspark.virtualenv.python_version Python version for conda. (optional, only required when you use conda in interactive mode) Penalty of virtualenv For each executor, It takes some time to set up the virtualenv (installing the packages). The first time may be very slow. For example, the first time I installed numpy on each node it took almost three minutes, because it needed to download files and compile them into wheel format. The next time it only took three seconds to install numpy, because it installed numpy from the cached wheel file. Related JIRA https://issues.apache.org/jira/browse/SPARK-13587 https://issues.apache.org/jira/browse/ZEPPELIN-2233

jzhang · ‎05-26-2017

Introduction For a simple PySpark application, you can use `--py-files` to specify its dependencies. A large PySpark application will have many dependencies, possibly including transitive dependencies. Sometimes a large application needs a Python package that has C code to compile before installation. And, there are times when you might want to run different versions of python for different applications. For such scenarios with large PySpark applications, `--py-files` is inconvenient. Fortunately, in the Python world you can create a virtual environment as an isolated Python runtime environment. We recently enabled virtual environments for PySpark in distributed environments. This eases the transition from local environment to distributed environment with PySpark. In this article, I will talk about how to use virtual environment in PySpark. (This feature is currently only supported in yarn mode.) Prerequisites Hortonworks supports two approaches for setting up a virtual environment: virtualenv and conda. All nodes must have either virtualenv or conda installed, depending on which virtual environment tool you choose. Either virtualenv or conda should be installed in the same location on all nodes across the cluster. To install virtualenv, see https://virtualenv.pypa.io/en/stable/installation/ Note that pip is required to run virtualenv; for pip installation instructions, see https://pip.pypa.io/en/stable/installing/. To install conda, see https://docs.continuum.io/anaconda/install. Each node must have internet access (for downloading packages). Python 2.7 or Python 3.x must be installed (pip is also installed). Now I will talk about how to set up a virtual environment in PySpark, using virtualenv and conda. There are two scenarios for using virtualenv in pyspark: Batch mode, where you launch the pyspark app through spark-submit. Interactive mode, using a shell or interpreter such as pyspark-shell or zeppelin pyspark. In HDP 2.6 we support batch mode, but this post also includes a preview of interactive mode. Batch mode For batch mode, I will follow the pattern of first developing the example in a local environment, and then moving it to a distributed environment, so that you can follow the same pattern for your development. Using virtualenv In this example we will use the following piece of code. This piece of code uses numpy in each map function. We save the code in a file named spark_virtualenv.py. from pyspark import SparkContext if __name__ == "__main__": sc = SparkContext(appName="Simple App") import numpy as np sc.parallelize(range(1,10)).map(lambda x : np.__version__).collect() Using virtualenv in the Local Environment First we will create a virtual environment in the local environment. We highly recommend that you create an isolated virtual environment locally first, so that the move to a distributed virtualenv will be more smooth. We use the following command to create and set up env_1 in the local environment: virtualenv env_1 -p /usr/local/bin/python3 # create virtual environment env_1 Folder env_1 will be created under the current working directory. You should specify the python version, in case you have multiple versions installed. Next, activate the virtualenv: source env_1/bin/activate # activate virtualenv After that you can run PySpark in local mode, where it will run under virtual environment env_1. You will see a "No module" error because numpy is not installed in this virtual environment. So, now let’s install numpy through pip: pip install numpy # install numpy After installing numpy, you can use numpy in PySpark apps launched by spark-submit in your local environment. Use the following command: bin/spark-submit --master local spark_virtualenv.py Using virtualenv in a Distributed Environment Now let’s move this into a distributed environment. There are two steps for moving from a local development to a distributed environment. Create a requirements file which contains the specifications of your third party Python dependencies. The following command will put all of the installed Python package info in the current virtual environment into this file, so keep to stay in the virtual environment you created above. pip freeze > requirements.txt Here’s sample output from the requirements file: numpy==1.12.0 Run the pyspark app through spark-submit. Use the following commands to launch pyspark in yarn-client mode: spark-submit --master yarn-client --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/Users/jzhang/github/spark/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/Users/jzhang/anaconda/bin/virtualenv --conf spark.pyspark.python=/usr/local/bin/python3 spark_virtualenv.py You will see the following output, which shows that we have installed numpy on each executor successfully: Using conda Next, I will talk about how to create a virtual environment using conda. The process is very similar to virtualenv, but uses different commands. Here is the command to create virtual environment env_conda_1 with Python 2.7 in the local environment. Folder env_conda_1 will be created under the current working directory: conda create --prefix env_conda_1 python=2.7 Use the following command to activate the virtual environment:. source activate env_conda_1 // activate this virtual environment Next, install numpy using the conda install command: conda install numpy Use the following command to create the requirements file. This command will put all of the installed Python package info into this file, so keep to stay in this virtual environment you created above. conda list --export > requirements_conda.txt Run the pyspark job in yarn-client mode: bin/spark-submit --master yarn-client --conf spark.pyspark.virtualenv.enabled=true--conf spark.pyspark.virtualenv.type=conda--conf spark.pyspark.virtualenv.requirements=/Users/jzhang/github/spark/requirements_conda.txt --conf spark.pyspark.virtualenv.bin.path=/Users/jzhang/anaconda/bin/condaspark_virtualenv.py You will see output similar to the following, which looks the same as the local example: Interactive mode (Preview) Interactive mode is not yet supported; the following information is a preview. Interactive mode means that you don’t need to specify the requirements file when launching PySpark, and you can install packages in your virtualenv at runtime. Interactive mode is very useful for pyspark shell and notebook environments. Using Interactive Mode with virtualenv The following command launches the pyspark shell with virtualenv enabled. In the Spark driver and executor processes it will create an isolated virtual environment instead of using the default python version running on the host. bin/pyspark --master yarn-client --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native--conf spark.pyspark.virtualenv.bin.path=/Users/jzhang/anaconda/bin/virtualenv--conf spark.pyspark.python=/Users/jzhang/anaconda/bin/python After you launch this pyspark shell, you will have a clean python runtime environment on both driver and executors. You can use sc.install_packages to install any python packages that could could be installed by pip; for example: sc.install_packages(“numpy”) # install the latest numpy sc.install_packages(“numpy==1.11.0”) # install a specific version of numpy sc.install_packages([“numpy”, “pandas”]) # install multiple python packages After that, you can use the packages that you just installed: import numpy sc.range(4).map(lambda x: numpy.__version__).collect() Using Interactive Mode with conda Interactive mode with conda is almost the same as with virtualenv. One exception is that you need to specify spark.pyspark.virtualenv.python_version, because conda needs to specify a python version to create the virtual environment. bin/pyspark --master yarn-client --conf spark.pyspark.virtualenv.enabled=true--conf spark.pyspark.virtualenv.type=conda--conf spark.pyspark.virtualenv.bin.path=/Users/jzhang/anaconda/bin/conda--conf spark.pyspark.virtualenv.python_version=3.5 PySpark VirtualEnv Configurations Property Description spark.pyspark.virtualenv.enabled Property flag to enable virtualenv spark.pyspark.virtualenv.type Type of virtualenv. Valid values are “native”, “conda” spark.pyspark.virtualenv.requirements Requirements file (optional, not required for interactive mode) spark.pyspark.virtualenv.bin.path The location of virtualenv executable file for type native or conda executable file for type conda spark.pyspark.virtualenv.python_version Python version for conda. (optional, only required when you use conda in interactive mode) Penalty of virtualenv For each executor, It takes some time to set up the virtualenv (installing the packages). The first time may be very slow. For example, the first time I installed numpy on each node it took almost three minutes, because it needed to download files and compile them into wheel format. The next time it only took three seconds to install numpy, because it installed numpy from the cached wheel file. Related JIRA https://issues.apache.org/jira/browse/SPARK-13587 https://issues.apache.org/jira/browse/ZEPPELIN-2233

jzhang · ‎05-24-2017

The sandbox file name is HDP_2.6_virtualbox_05_05_2017_14_46_00_hdp.ova for me

jzhang · ‎05-23-2017

I download sandbox 2.6 yesterday and hit this issue.

jzhang · ‎05-23-2017

Why not updating sandbox image ? It is a pretty bad user expereince for sandbox users.

jzhang · ‎02-13-2017

livy-env.sh is shared by all the sessions which means one livy instance can only run one version of python. I would recommend user to use spark configuration spark.pyspark.driver.python and spark.pyspark.python in spark2 (HDP 2.6) so that each session can set his own python version. https://issues.apache.org/jira/browse/SPARK-13081

jzhang · ‎01-23-2017

Introduction Apache Zeppelin is a web-based notebook that enables interactive data analytics while Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs. Pig-latin is a very powerful languages for data flow processing. One drawback of pig community complains about is that pig-latin is not a standard language like sql so very few BI tools integrate with it. So it is pretty to hard to visualize the result from pig. Now the good news is that Pig is integrated in zeppelin 0.7 where you can write pig latin and visualize the result. Use pig interpreter Pig interpreter is supported from zeppelin 0.7.0, so first you need to install zeppelin, you can refer this link for how to install and start zeppelin. Zeppelin supports 2 kinds of pig interpreters for now. %pig (default interpreter) %pig.query %pig is like the pig grunt shell. Anything you can run in pig grunt shell can be run in %pig.script interpreter, it is used for running pig script where you don’t need to visualize the data, it is suitable for data munging. %pig.query is a little different compared with %pig.script. It is used for exploratory data analysis by using pig latin where you can leverage zeppelin’s visualization ability. There're 2 minor differences in the last statement between %pig and %pig.query No pig alias in the last statement in %pig.query (read the examples below). The last statement must be in single line in %pig.query Here I will give 4 simple examples to illustrate how to use these 2 interpreters. These 4 examples are another implementation of zeppelin tutorial where spark is used. We just do the same thing by using pig instead. This script do the data preprocessing %pig bankText = load 'bank.csv' using PigStorage(';'); bank = foreach bankText generate $0 as age, $1 as job, $2 as marital, $3 as education, $5 as balance; bank = filter bank by age != '"age"'; bank = foreach bank generate (int)age, REPLACE(job,'"','') as job, REPLACE(marital, '"', '') as marital, (int)(REPLACE(balance, '"', '')) as balance; store bank into 'clean_bank.csv' using PigStorage(';'); -- this statement is optional, it just show you that most of time %pig.script is used for data munging before querying the data. Get the number of each age where age is less than 30 %pig.query bank_data = filter bank by age < 30; b = group bank_data by age; foreach b generate group, COUNT($1); The same as above, but use dynamic text form so that use can specify the variable maxAge in textbox. (See screenshot below). Dynamic form is a very cool feature of zeppelin, you can refer this link for details. %pig.query bank_data = filter bank by age < ${maxAge=40}; b = group bank_data by age; foreach b generate group, COUNT($1); Get the number of each age for specific marital type, also use dynamic form here. User can choose the marital type in the dropdown list (see screenshot below). %pig.query bank_data = filter bank by marital=='${marital=single,single|divorced|married}'; b = group bank_data by age; foreach b generate group, COUNT($1); The following is a screenshot of these 4 examples. You can also check pig tutorial note which contains all the code of this blog in zeppelin. Configuration Pig interpreter in zeppelin supports all the execution engine that pig supports. Local Mode Nothing needs to be done for local mode MapReduce Mode HADOOP_CONF_DIR needs to be specified in ZEPPELIN_CONF_DIR/zeppelin-env.sh Tez Local Mode Nothing needs to be done for tez local mode Tez Mode HADOOP_CONF_DIR and TEZ_CONF_DIR needs to be specified in ZEPPELIN_CONF_DIR/zeppelin-env.sh The default mode is mapreduce, but you can change that in interpreter setting. You can also set any pig configuration in the interpreter setting page. Here's one screenshot of that. Future work This is the first phase work to integrate pig into zeppelin. There’s lots of work needs to do in future. Here’s my current to-do list Integrate spark engine so that we can use spark sql together with pig-latin Integrate spark mllib so that we can use pig-latin to do machine learning Add new interpreter %pig.udf to allow user to write java udf in zeppelin Integrate more closely with datafu If you have any other new ideas, please contact me at jzhang@hortonworks.com or you can file ticket in apache zeppelin jira https://issues.apache.org/jira/browse/ZEPPELIN

jzhang · ‎12-09-2016

Zeppelin is A web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more. The latest version of zeppelin is 0.6.2 when this article is written, although the community has made lots of effort to improve it. Sometimes you would still meet some weird issues due to environment issue, wrong configuration or zeppelin’s bug of itself. This article would try to illustrate to you to how to diagnose zeppelin if you meet some issues that you can't figure out what’s wrong. Zeppelin Architecture Before I go to details I’d like to give an illustration of zeppelin’s architecture. So that we can understand where to diagnose. The above is a diagram of zeppelin’s diagram. Overall it has 3 layers: Frontend Zeppelin Server Interpreter Process I would not talk about the details, but just want you to have an overall picture of what components zeppelin has, and usually we hit issues on zeppelin server and interpreter process. Next I will talk about them one by one. Diagnose Zeppelin Server The most efficient tool to diagnose one software/library is log, log and log again. Usually you can figure out what’s wrong in log. Zeppelin Server’ log is in folder $ZEPPELIN_LOG_DIR, it is in /var/log/zeppelin for HDP and it is $ZEPPLEIN_HOME/logs if you use apache zeppelin distribution and doesn’t set ZEPPELIN_LOG_DIR. The log file name is zeppelin-<user>-<host>.log, there’s other files under the log dir, I will talk about them in the next section. Zeppelin use log4j and its default log level is INFO. log4j.properties is located in /etc/zeppelin/conf for HDP, and in $ZEPPELIN_HOME/conf for apache zeppelin distribution if you didn’t specify ZEPPELIN_CONF_DIR. You can update log4j.properties to change log level. First change log4j.appender.dailyfile.Threshold to DEBUG, then add package level log setting. Here's my log4j.properties for your reference log4j.rootLogger = INFO, dailyfile log4j.appender.stdout = org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout = org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%5p [%d] ({%t} %F[%M]:%L) - %m%n log4j.appender.dailyfile.DatePattern=.yyyy-MM-dd log4j.appender.dailyfile.Threshold = DEBUG log4j.appender.dailyfile = org.apache.log4j.DailyRollingFileAppender log4j.appender.dailyfile.File = ${zeppelin.log.file} log4j.appender.dailyfile.layout = org.apache.log4j.PatternLayout log4j.appender.dailyfile.layout.ConversionPattern=%5p [%d] ({%t} %F[%M]:%L) - %m%n log4j.logger.org.apache.zeppelin.interpreter.InterpreterFactory=DEBUG log4j.logger.org.apache.zeppelin.notebook.Paragraph=DEBUG log4j.logger.org.apache.zeppelin.scheduler=DEBUG log4j.logger.org.apache.zeppelin.livy=DEBUG log4j.logger.org.apache.zeppelin.flink=DEBUG log4j.logger.org.apache.zeppelin.spark=DEBUG log4j.logger.org.apache.zeppelin.interpreter.util=DEBUG log4j.logger.org.apache.zeppelin.interpreter.remote=DEBUG Diagnose Interpreter Process According my experience, most of problems happen on the interpreter process side. There’s 2 kinds of scenario. Interpreter process fail to launch. Interpreter process can launch but fail to run paragraph. Zeppelin would launch interpreter process by calling interpreter.sh which is located in $ZEPPELIN_HOME/bin. And each interpreter process has one log file located in $ZEPPELIN_LOG_DIR I mentioned before. The log file pattern is zeppelin-interpreter-<interpreter_name>-<user>-<host>.log. Interpreter process share the same log4j.properties with zeppelin-server, so you can change log configuration as I mentioned above. Usually you can check the interpreter log file to figure out what’s wrong. But sometimes there’s no such log file, usually this is because interpeter.sh fail to launch the interpreter process. For this case, you need to modify log4j.properties as above (change log4j.appender.dailyfile.Threshold to DEBUG and chage the log level of log4j.logger.org.apache.zeppelin.interpreter.remote to DEBUG). And it is very useful to add the following line to zeppelin-env.sh so that you can see the spark submit command in log. export SPARK_PRINT_LAUNCH_COMMAND=true The following is the output in my machine, you can see the spark-submit command, which configuration we use and what classpath we use. Usually you can get all the context to figure out what’s wrong. INFO [2016-12-09 11:50:31,640] ({pool-2-thread-2} RemoteInterpreterManagedProcess.java[start]:120) - Run interpreter process [/Users/jzhang/github/zeppelin/bin/interpreter.sh, -d, /Users/jzhang/github/zeppelin/interpreter/spark, -p, 56009, -l, /Users/jzhang/github/zeppelin/local-repo/2C4XVCNK1] DEBUG [2016-12-09 11:50:31,642] ({pool-2-thread-2} RemoteInterpreterUtils.java[checkIfRemoteEndpointAccessible]:53) - Remote endpoint 'localhost:56009' is not accessible (might be initializing): Connection refused DEBUG [2016-12-09 11:50:31,853] ({Exec Stream Pumper} RemoteInterpreterManagedProcess.java[processLine]:189) - Spark Command: /Library/Java/JavaVirtualMachines/jdk1.8.0_45.jdk/Contents/Home/bin/java -cp /Users/jzhang/github/zeppelin/interpreter/spark/*:/Users/jzhang/github/zeppelin/zeppelin-interpreter/target/lib/*:/Users/jzhang/github/zeppelin/zeppelin-interpreter/target/classes/:/Users/jzhang/github/zeppelin/zeppelin-interpreter/target/test-classes/:/Users/jzhang/github/zeppelin/zeppelin-zengine/target/test-classes/:/Users/jzhang/github/zeppelin/interpreter/spark/zeppelin-spark_2.10-0.7.0-SNAPSHOT.jar:/Users/jzhang/Java/lib/spark-2.0.2/conf/:/Users/jzhang/Java/lib/spark-2.0.2/assembly/target/scala-2.11/jars/*:/Users/jzhang/Java/lib/hadoop-2.7.2/etc/hadoop/ -Xmx1g -Dlog4j.debug=true -Dfile.encoding=UTF-8 -Dlog4j.configuration=file:///Users/jzhang/github/zeppelin/conf/log4j.properties -Dzeppelin.log.file=/Users/jzhang/github/zeppelin/logs/zeppelin-interpreter-spark-jzhang-jzhangMBPr.local.log org.apache.spark.deploy.SparkSubmit --conf spark.driver.extraClassPath=::/Users/jzhang/github/zeppelin/interpreter/spark/*:/Users/jzhang/github/zeppelin/zeppelin-interpreter/target/lib/*::/Users/jzhang/github/zeppelin/zeppelin-interpreter/target/classes:/Users/jzhang/github/zeppelin/zeppelin-interpreter/target/test-classes:/Users/jzhang/github/zeppelin/zeppelin-zengine/target/test-classes:/Users/jzhang/github/zeppelin/interpreter/spark/zeppelin-spark_2.10-0.7.0-SNAPSHOT.jar --conf spark.driver.extraJavaOptions=-Dlog4j.debug=true -Dfile.encoding=UTF-8 -Dlog4j.configuration=file:///Users/jzhang/github/zeppelin/conf/log4j.properties -Dzeppelin.log.file=/Users/jzhang/github/zeppelin/logs/zeppelin-interpreter-spark-jzhang-jzhangMBPr.local.log --class org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer /Users/jzhang/github/zeppelin/interpreter/spark/zeppelin-spark_2.10-0.7.0-SNAPSHOT.jar 56009 DEBUG [2016-12-09 11:50:31,853] ({Exec Stream Pumper} RemoteInterpreterManagedProcess.java[processLine]:189) - ======================================== .... Advanced Diagnose Approach Sometimes logs may still not be sufficient for you, then you need to debug the zeppelin server process and interpreter process. In that case, you need to configure the following enviroment variable in zeppelin-env.sh export ZEPPELIN_JAVA_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005" export ZEPPELIN_INTP_JAVA_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=6006" So that you can remote debug the zeppelin-server and interpreter process like any other java process. (port 5005 for zeppelin server process, port 6006 for interpreter process)

jzhang · ‎09-06-2016

Yes, you can install both spark 1.6 and 2.0 in HDP 2.5

Online	Offline
Last Visited	‎04-17-2018 01:31 PM

Member Since	‎10-08-2015 01:52 AM
Last Visited	‎04-17-2018 01:31 PM
Posts	108
Kudos received	60

Cloudera Community

Re: How to diagnose zeppelin

Using VirtualEnv with PySpark

Using VirtualEnv with PySpark

Re: How to configure networks on the VirtualBox H...

Re: How to configure networks on the VirtualBox H...

Re: How to configure networks on the VirtualBox H...

Re: How can I configure pyspark on livy to use ana...

How to use pig in zeppelin

How to diagnose zeppelin

Re: How to install and run Spark 2.0 on HDP 2.5 Sa...