Member since
02-27-2020
173
Posts
42
Kudos Received
48
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1046 | 11-29-2023 01:16 PM | |
1145 | 10-27-2023 04:29 PM | |
1138 | 07-07-2023 10:20 AM | |
2489 | 03-21-2023 08:35 AM | |
886 | 01-25-2023 08:50 PM |
10-06-2020
10:05 AM
The error seems to indicate that the source JSON is malformed. Check where the data is stored and look at the JSON structure. Each row should be one, self-contained JSON. Please post screenshot here. Also, did you add the necessary jar to Hive: hive-serdes-1.0-SNAPSHOT.jar. I assume you are following this example: https://github.com/cloudera/cdh-twitter-example Finally, you can try a different serDe as shown in this topic: https://community.cloudera.com/t5/Support-Questions/hive-table-error/td-p/127271 Or try this solution on stackoverflow: https://stackoverflow.com/questions/32416555/twitter-sentiment-analysis
... View more
09-30-2020
10:20 PM
You can try doing the import with --hs2-url parameter as described in this Cloudera doc. This is another approach to get sqoop import to connect to Hive, given that the standard way with HDOOP_HOME doesn't seem to work. Also note that sqoop parameters should be --hive-database default --hive-table jobs. This is related to Hive-16907, and while your syntax is currently allowed in sqoop, it soon may be validated away. Hope this helps.
... View more
09-29-2020
02:38 PM
3 Kudos
By default, Cloudera Machine Learning (CML) ships Jupyter kernel as part of the base engine images. Data Scientists often prefer to use a specialized custom kernel in Jupyter that makes their work more efficient. In this community post, we will walk through how to customize a Docker container image with a sparkmagic Jupyter kernel and how to deploy it to a CML workspace.
Prerequisites:
Admin privileges in a CML workspace
Local Docker client with access to Docker Hub or internal Docker registry
Step 1. Choose a custom Jupyter kernel.
Jupyter kernels are purpose-built add-ons to the basic Python notebook editor. For this tutorial, I chose sparkmagic as the kernel that provides convenient features for working with Spark, like keeping SQL syntax clean in a cell. Sparkmagic relies on Livy to communicate with the Spark cluster. As of this writing, Livy is not supported in CML when running Spark on Kubernetes. However, your classic Spark cluster (for example on Data Hub) will work with Levy and therefore sparkmagic. For now, you simply need to know that installing sparkmagic is done with the following sequence of commands:
pip3 install sparkmagic jupyter nbextension enable --py --sys-prefix widgetsnbextension jupyter-kernelspec install sparkmagic/kernels/pysparkkernel
Note: The third line is executed once you cd in the directory that is created after the install. This location is platform dependent and is determined by running pip3 show sparkmagic after the install. We’ll have to take care of this in the docker image definition.
Step 2. Customize your Docker Image
To create a custom Docker image we first create a text file, I called it magic-dockr, that specifies the base image (CML base engine on Ubuntu) along with additional libraries we want to install. I will use CML to do the majority of the work.
First, create the below docker file in your CML project.
# Dockerfile # Specify a Cloudera Machine Learning base image FROM docker.repository.cloudera.com/cdsw/engine:9-cml1.1 # Update packages on the base image and install beautifulsoup4 RUN apt-get update RUN pip3 install sparkmagic RUN jupyter nbextension enable --py --sys-prefix widgetsnbextension RUN jupyter-kernelspec install --user $(pip3 show sparkmagic | grep Location | cut -d" " -f2)/sparkmagic/kernels/pysparkkernel
Now we use this image definition to build a deployable Docker container. Run the following commands in an environment where docker.io binaries are installed.
docker build -t <your-repository>/cml-sparkmagic:v1.0 . -f magic-dockr docker push <your-repository>/cml-sparkmagic:v1.0
This will build and distribute your Docker image to a repository of your choosing.
Step 3. Adding a custom image to CML
There are two steps to make the custom kernel available in your project. One to add the image to CML Workspace and the other to enable the image for the Project you are working on.
The first step requires Admin privileges. From the blade menu on the left, select Admin then click on the Engines tab. In the Engine Images section, enter the name of your custom image (e.g. Sparkmagic Kernel) and the repository tag you used in Step 2. Click Add.
Once the engine is added, we’ll need to tell CML how to launch a Jupyter notebook when this image is used to run a session. Click the Edit button next to the Sparkmagic Kernel you’ve added. Click + New Editor in the window that opens.
Enter the editor name as Jupyter Notebook and for the command use the following:
/usr/local/bin/jupyter-notebook --no-browser --ip=127.0.0.1 --port=8090 --NotebookApp.token= --NotebookApp.allow_remote_access=True --log-level=ERROR
Note that port 8090 is the default port, unless your administrator changed it.
Then click Save and Save again.. At this point CML knows where to find your custom kernel and what editor to launch when a session starts.
Now we are ready to enable this custom engine inside a project.
Step 4. Enable custom engine in your project.
Open a project where you would like to use your custom kernel. For me, it’s a project called Custom Kernel Project (yes, I’m not very creative when it comes to names). In the left panel, click on Project Settings, then go to the Engine tab. From the Engine Image section, drop-down select your custom engine image.
To test the engine, go to Sessions, and create a new session. You’ll see that Engine Image is the custom Docker image you’ve created in Step 2. Name your session and select Jupyter Notebook as your Editor.
When the session launches, in the Jupyter notebook interface you’ll be able to select PySpark when creating a new notebook.
You can start with %%help magic and follow along with Sparkmagic documentation. Specifically, you’ll want to configure a connection to a Spark cluster using a JSON template provided.
That’s it!
CML brings you the flexibility to run any third-party editor on the platform, making development more efficient for Data Scientists and Data Engineers. Note that while this article talked about sparkmagic custom kernel, the same procedure can be applied to any kernel you wish to run with Jupyter notebook or Jupyter Lab.
Reference:
CML Docs: Creating a Customized Engine Image
Sparkmagic Docs
... View more
09-16-2020
12:21 PM
If you are doing this through Hue, please post the generated XML once you submit the job.
... View more
09-16-2020
12:20 PM
Hi James, Thanks for clarifying your question. It's true that there is no native functionality for this, however it is possible to change the action name in a slightly hacky way: 1. In the edit mode of your Oozie workflow click on the name of the node and note the id: 2. Save and export your workflow. This will give you access to a JSON file that you can edit. 3. In that JSON file do a search and replace for -[NODE ID]\" and replace with your desired name for the node. All of the references to the old node ID should be replaced to maintain references. Save the file. 4. Import your JSON back into Hue. This will update your existing workflow file and now the generated Oozie XML will have the name for the node that you want. Hope this helps. Regards, Alex
... View more
09-15-2020
10:51 AM
Hi James, You should be able to do that as follows for any action, when you define an action: <action name="firstparallejob"> If this solution works, please accept it as such. Regards, Alex
... View more
09-15-2020
10:43 AM
Hi James, Could you share your XML action script for that SSH action? In general, the following example shows how to pass arguments to your shell (from Oozie docs😞 <workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
...
<action name="myssjob">
<ssh>
<host>foo@bar.com<host>
<command>YOURCOMMAND</command>
<args>${wf:user()}</args>
<args>${wf:lastErrorNode()}</args>
</ssh>
<ok to="myotherjob"/>
<error to="errorcleanup"/>
</action>
...
</workflow-app> Note the ${} syntax of the parameter. If this helps, please accept this as a solution. Regards, Alex
... View more
07-24-2020
10:29 PM
In your query you have a typo in "DELIMITED". Please double check and try again.
... View more
07-21-2020
11:16 PM
Did you follow the naming conventions for python modules specified in Python documentation? Is your module p1 in the same folder as module p2?
... View more