Member since
01-05-2016
55
Posts
37
Kudos Received
6
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
806 | 10-21-2019 05:16 AM | |
3766 | 01-29-2018 07:05 AM | |
2581 | 06-27-2017 06:42 AM | |
37063 | 05-26-2016 04:05 AM | |
26721 | 05-17-2016 02:15 PM |
05-17-2016
02:15 PM
24 Kudos
Hi, please perform the following actions: 1) Fill in "Source Brokers List" --> "nameOfClouderaKafkaBrokerServer.yourdomain.com:9092". This is the Server (or Servers) where you configured the Kafka Broker (NOT the MirrorMaker). 2) Fill in "Destination Brokers List" --> "nameOfRemoteBrokerServer.otherdomain.com:9092". This is supposed to be a remote Cluster that will receive Topics sent over by your MirrorMaker. If you have one, put in that one. Otherwise just put in another Server in your network, whatever Server. Please note that both this Server Names must be FQDN and resolvable by your DNS (or hosts file), otherwise you'll get other errors. Also the format with the trailing Port Number is mandatory! 3) Click "Continue". Service will NOT start (error). Do not navigate away from that screen 4) Open another Cloudera Manager in another browser pane. You should now see "Kafka" in the list of Services (red, but it should be there). Click on the Kafka Service and then "Configure". 5) Search for the "java heap space" Configuration Property. The standard Java Heap Space you'll find already set up should be 50 MBytes. Put in at least 256 MBytes. The original value is simply not enough. 6) Now search for the "whitelist" Configuration Property. In the field, put in "(?!x)x" (without the quotation marks). That's a regular expression that does not match anything. Given that apparently a Whitelist is mandatory for the Mirrormaker Service to start, and I'm assuming you don't want to replicate any topics remotely right now, just put in something that won't replicate anything e.g. that regular expression. 7) Save the changes and go back to the original Configuration Screen on the othe browser pane. Click "Retry", or wathever, or even exit that screen and manually restart the Kafka Service in Cloudera Manager. That should work, at least it did for me! HTH
... View more
05-14-2016
03:39 PM
Update: If I use "spark-submit", the script runs successfully. Syntax used for "spark-submit": spark-submit \
--master yarn-cluster \
--deploy-mode cluster \
--executor-memory 500M \
--total-executor-cores 1 \
hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/lib/test.py \
10 Excerpt from output log: 16/05/15 00:30:57 INFO parse.ParseDriver: Parsing command: select * from sales_fact
16/05/15 00:30:58 INFO parse.ParseDriver: Parse Completed
16/05/15 00:30:58 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0-cdh5.5.1
16/05/15 00:30:58 INFO client.ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.5.1
16/05/15 00:30:59 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
16/05/15 00:30:59 INFO spark.SparkContext: Invoking stop() from shutdown hook
... View more
05-14-2016
08:53 AM
Hi all, my CDH test rig is as follows: CDH 5.5.1 Spark 1.5.0 Oozie 4.1.0 I have successfully created a simple Oozie Workflow that spawns a Spark Action using HUE Interface. My intention is to use Yarn in Cluster mode to run the Workflow/Action. It's a Python script, which is as follows (just a test): from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import *
sconf = SparkConf().setAppName("MySpark").set("spark.driver.memory", "1g").setMaster("yarn-cluster")
sc = SparkContext(conf=sconf)
### (1) ALTERNATIVELY USE ONE OF THE FOLLOWING CONTEXT DEFINITIONS:
sqlCtx = SQLContext(sc)
#sqlCtx = HiveContext(sc)
### (2) IF HIVECONTEXT, EVENTUALLY SET THE DATABASE IN USE (SHOULDN'T BE NECESSARY):
#sqlCtx .sql("use default")
### (3) CREATE MAIN DATAFRAME. TRY THE SYNTAXES ALTERNATIVELY, COMBINE WITH DIFFERENT (1):
#cronologico_DF = sqlCtx.table("sales_fact")
cronologico_DF = sqlCtx.sql("select * from sales_fact")
### (4) ANOTHER DATAFRAME
extraction_cronologico_DF = cronologico_DF.select("PRODUCT_KEY")
### (5) USELESS PRINT STATEMENT:
print 'a' When I run the Workflow, a Mapreduce Job is started. Shortly after, a Spark Job is spawned (I can see that from the Job Browser). The Spark Job fails with the following error (excerpt from the Log File of the Spark Acrion): py4j.protocol.Py4JJavaError: An error occurred while calling o51.sql.
: java.lang.RuntimeException: Table Not Found: sales_fact This is my "workflow.xml": <workflow-app name="Churn_2015" xmlns="uri:oozie:workflow:0.5">
<global>
<job-xml>hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/hive-site.xml</job-xml>
<configuration>
<property>
<name>oozie.launcher.yarn.app.mapreduce.am.env</name>
<value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark</value>
</property>
</configuration>
</global>
<start to="spark-3ca0"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="spark-3ca0">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/hive-site.xml</job-xml>
<configuration>
<property>
<name>oozie.use.system.libpath</name>
<value>true</value>
</property>
</configuration>
<master>yarn-cluster</master>
<mode>cluster</mode>
<name>MySpark</name>
<class>org.apache.spark.examples.mllib.JavaALS</class>
<jar>hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/lib/test.py</jar>
</spark>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app> This is my "job.properties": oozie.use.system.libpath=True
security_enabled=False
dryrun=False
jobTracker=<MY_SERVER_FQDN_HERE>:8032
nameNode=hdfs://<MY_SERVER_FQDN_HERE>:8020 Please note that: 1) I've also uploaded "hive-site.xml" in the same directory as the 2 files described above. As you can see from "workflow.xml", it should also be picked up. 2) The "test.py" script is under a "lib" directory in the Workspace created by HUE. It gets picked up. In that directory I also took care of uploading several Jars belonging to some Derby DB Connector, probably required to collect Stats, to avoid other exceptions being throwed. 3) I've tried adding a workflow property "oozie.action.sharelib.for.spark", with value "hcatalog,hive,hive2", with no success 4) As you can see in the Python Script described above, I've been using alternatively an SQLContext or a HiveContext object inside the Script. Results are the same (the error message is slightly different though). 5) ShareLib should be OK too: oozie admin -shareliblist
[Available ShareLib]
oozie
hive
distcp
hcatalog
sqoop
mapreduce-streaming
spark
hive2
pig I'm suspecting the Tables Metastore is not being read, that's probably the issue. But I ran out of ideas and I'm not able to get it working... Thanks in advance for any feedback!
... View more
Labels:
- Labels:
-
Apache Oozie
05-13-2016
03:03 AM
2 Kudos
I got past this! Still no cigar, though. Now I have another error, but I'm going to work on this. It's something different now... Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, File file:/hdp01/yarn/nm/usercache/admin/appcache/application_1463068686660_0013/container_1463068686660_0013_01_000001/lib/test.py does not exist
java.io.FileNotFoundException: File file:/hdp01/yarn/nm/usercache/admin/appcache/application_1463068686660_0013/container_1463068686660_0013_01_000001/lib/test.py does not exist Many thanks for your help. I'd never be able to figure this out by myself!
... View more
05-13-2016
02:42 AM
Hi Ben, thanks a whole lot for your reply. May I ask you where exactly you specified that setting? - In the GUI, in some particular field? - In "workflow.xml", in the Job's directory in HDFS? If yes: as an "arg", as a "property", or..? - In "job.properties", in the Job's directory in HDFS? If yes: how? - In some other file? E.g. "/etc/alternatives/spark-conf/spark-defaults.conf"? If yes, how? A snippet of your code would be extremely appreciated! I'm asking you because I've tried all of the above with your suggestion but I did not succeed. Thanks again for your help
... View more
04-16-2016
09:52 AM
1 Kudo
Hello, I'm using CDH 5.5.1 with Spark 1.5.0. I'm unsuccessfully trying to execute a simple Spark action (Python script) via Oozie. As for now I just want to be able to run something at all, the script is still a silly example, it doesn't really do anything. It is as follows: ## IMOPORT FUNCTIONS
from pyspark.sql.functions import *
## CREATE MAIN DATAFRAME
eventi_DF = sqlContext.table("eventi") I created a simple Oozie Workflow from Hue GUI. I used the following settings for the Spark action: SPARK MASTER: yarn-cluster
MODE: cluster
APP NAME: MySpark
JARS / PY FILES: lib/test.py
MAIN CLASS: org.apache.spark.examples.mllib.JavaALS
ARGUMENTS: <No Arguments Defined> I've uploaded the Script in HDFS under the Workspace "/user/hue/oozie/workspaces/hue-oozie-1460736691.98/lib" directory, and I'm sure it gets picked up (as just to understand it was meant to be put in this directory I had to work a little bit, fighting a "test.py" not found" exception, that now is not there anymore). As of now, when I try to run the Workflow by pressing the "Play" button on GUI, this is what I get in the Action Logfile: >>> Invoking Spark class now >>>
<<< Invocation of Main class completed <<<
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, key not found: SPARK_HOME
java.util.NoSuchElementException: key not found: SPARK_HOME
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:58)
at org.apache.spark.deploy.yarn.Client$$anonfun$findPySparkArchives$2.apply(Client.scala:943)
at org.apache.spark.deploy.yarn.Client$$anonfun$findPySparkArchives$2.apply(Client.scala:942)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.deploy.yarn.Client.findPySparkArchives(Client.scala:942)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:630)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:124)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:914)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:973)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
at org.apache.oozie.action.hadoop.SparkMain.runSpark(SparkMain.java:185)
at org.apache.oozie.action.hadoop.SparkMain.run(SparkMain.java:176)
at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:49)
at org.apache.oozie.action.hadoop.SparkMain.main(SparkMain.java:46)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:236)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runSubtask(LocalContainerLauncher.java:378)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runTask(LocalContainerLauncher.java:296)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.access$200(LocalContainerLauncher.java:181)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler$1.run(LocalContainerLauncher.java:224)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Oozie Launcher failed, finishing Hadoop job gracefully
Oozie Launcher, uploading action data to HDFS sequence file: hdfs://mnethdp01.glistencube.com:8020/user/admin/oozie-oozi/0000000-160416120358569-oozie-oozi-W/spark-3ba6--spark/action-data.seq
Oozie Launcher ends Now, I guess the problem is: Failing Oozie Launcher, ... key not found: SPARK_HOME I've been trying hard to set this SPARK_HOME Key in different places. Things I've tried include the following: Modified Spark Config in Cloudera Manager and reloaded the Configuration: Spark Service Environment Advanced Configuration Snippet (Safety Valve):
SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark Modified Oozie Config in Cloudera Manager and reloaded the Configuration: Oozie Service Environment Advanced Configuration Snippet (Safety Valve)
SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark Used different ways of invoking the Spark Action inside the Spark Action Definition: SPARK MASTER: local[*]
SPARK MODE: client
SPARK MASTER: yarn-cluster
SPARK MODE: cluster
SPARK MASTER: yarn-client
SPARK MODE: client Modified the "/etc/alternatives/spark-conf/spark-defaults.conf" File manually, adding the following inside it (I did it just on one node, by the way): spark.yarn.appMasterEnv.SPARK_HOME /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark
spark.executorEnv.SPARK_HOME /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark All the above to no success. Apparently I'm not able to set the required key anywhere. What am I doing wrong? Isn't this meant to be pretty straightforward? Thanks in advance for any insights.
... View more
Labels:
- Labels:
-
Apache Oozie
03-06-2016
11:13 AM
Thanks for the suggestion about registering the class and for the additional info. I have to say that I had already read the link you sent me, but I didn't really get what was the meaning of "you have to register the classes first". In fact, I gave up with my tries at the time and sticked to Java Serializer for my testing purposes. Maybe I'll need to get back to this in the future, and I'll do it with additional knowledge now. Thanks a lot. Also, I have to say that now, after reading all this, I find it a bit strange that Cloudera sets Kryo as default serializer. Anyway.
... View more
02-07-2016
02:13 AM
Hi GeoffD, I don't know if what follows is going to help out somehow, but I had the chance to try a few different scenarios lately, and I've noticed that also with Parquet file format a few inconsistencies can occur sometimes. For instance: if you have data coming in from 2 different sources e.g. a Sqoop job creating a Parquet file and a Pig job creating another Parquet output, you have to take care to carefully define the names of the output fields to be all uppercase. Also when you create the Managed Table you have to refer to a "LIKE PARQUET" file with uppercase field names to be inferred. If you don't take care of all that, what I have experienced is that if you take the output files from several Pig and Sqoop procedures and put them under the Table's directory, the "NULL field" issue will happen. Even if the Schemas are apparently identical. What I have found out is that under some conditions (e.g. when you rename fields in a Sqoop or Pig job), the resulting Parquet Files will differ in the fact that the Sqoop job will ALWAYS create Uppercase Field Names, where the corresponding Pig Job does not do that and keeps the exact Case you'd have specified inside the Pig script. Now, I haven't been digging more in deep into all this, as for me getting to this result is enough. But I've started to think that in the new Spark, at some stage, there is something wrong when parsing the Field Names, probably due to a Case issue. So maybe you'd want to try and experiment around this, using your CSV and Plain Text Files too. Maybe you'll come up with a final solution with plain text files too 🙂 HTH
... View more
01-21-2016
02:35 AM
Hi, I'm experimenting with a small CDH virtual cluster and I'm facing issues with serializers. More specifically, I'm trying things with the "pyspark.mllib.fpm.FPGrowth" class (Machine Learning). This is what I'm trying to do: from pyspark.mllib.fpm import FPGrowth
fp_model = FPGrowth.train(my_words_DF,0.3)
freqset = sorted(fp_model.freqItemsets().collect()) At this point, according to what Serializer I have configured in Spark, I have 2 different outcomes. If in "Cloudera Manager --> Spark --> Configuration --> Spark Data Serializer" I configure "org.apache.spark.serializer.JavaSerializer" then everyhting works FINE. If in "Cloudera Manager --> Spark --> Configuration --> Spark Data Serializer" I configure "org.apache.spark.serializer.KryoSerializer" (which is the DEFAULT setting, by the way), when I collect the "freqItemsets" I get the following exception: com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: Can not set final scala.collection.mutable.ListBuffer field org.apache.spark.mllib.fpm.FPTree$Summary.nodes to scala.collection.mutable.ArrayBuffer This exception is confirmed to be a consequence of an unresolved bug "using Kryo with FPGrowth" in the following thread: https://issues.apache.org/jira/browse/SPARK-7483 Now my questions: 1) What is really the difference between Kryo and Java Serializers? 2) Is one Serializer definitely better in most use cases, and if yes which one? 3) Is it possible to dinamically switch between the 2 Serializers without exiting my Spark session and/or changing/redeploying Spark Configuration in Cloudera Manager? How? Thank in advance for any insight
... View more
Labels:
01-06-2016
02:28 AM
1 Kudo
Finally found a solution (or a workaround, don't know exactly how to call it... As apparently what I described above was not my fault, in the end, but was due to something that probably broke up since Spark 1.3.0). Using "Parquet" storage file format apparently does the trick. It works smooth and with no warnings. I think I'm going to use this file format from now on... At least until they fix the Partitioned Avro thing. Hope this helps somebody in the future 🙂
... View more
- « Previous
- Next »