About FrozenWave

FrozenWave · ‎05-17-2016

Hi, please perform the following actions: 1) Fill in "Source Brokers List" --> "nameOfClouderaKafkaBrokerServer.yourdomain.com:9092". This is the Server (or Servers) where you configured the Kafka Broker (NOT the MirrorMaker). 2) Fill in "Destination Brokers List" --> "nameOfRemoteBrokerServer.otherdomain.com:9092". This is supposed to be a remote Cluster that will receive Topics sent over by your MirrorMaker. If you have one, put in that one. Otherwise just put in another Server in your network, whatever Server. Please note that both this Server Names must be FQDN and resolvable by your DNS (or hosts file), otherwise you'll get other errors. Also the format with the trailing Port Number is mandatory! 3) Click "Continue". Service will NOT start (error). Do not navigate away from that screen 4) Open another Cloudera Manager in another browser pane. You should now see "Kafka" in the list of Services (red, but it should be there). Click on the Kafka Service and then "Configure". 5) Search for the "java heap space" Configuration Property. The standard Java Heap Space you'll find already set up should be 50 MBytes. Put in at least 256 MBytes. The original value is simply not enough. 6) Now search for the "whitelist" Configuration Property. In the field, put in "(?!x)x" (without the quotation marks). That's a regular expression that does not match anything. Given that apparently a Whitelist is mandatory for the Mirrormaker Service to start, and I'm assuming you don't want to replicate any topics remotely right now, just put in something that won't replicate anything e.g. that regular expression. 7) Save the changes and go back to the original Configuration Screen on the othe browser pane. Click "Retry", or wathever, or even exit that screen and manually restart the Kafka Service in Cloudera Manager. That should work, at least it did for me! HTH

FrozenWave · ‎05-14-2016

Update: If I use "spark-submit", the script runs successfully. Syntax used for "spark-submit": spark-submit \ --master yarn-cluster \ --deploy-mode cluster \ --executor-memory 500M \ --total-executor-cores 1 \ hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/lib/test.py \ 10 Excerpt from output log: 16/05/15 00:30:57 INFO parse.ParseDriver: Parsing command: select * from sales_fact 16/05/15 00:30:58 INFO parse.ParseDriver: Parse Completed 16/05/15 00:30:58 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0-cdh5.5.1 16/05/15 00:30:58 INFO client.ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.5.1 16/05/15 00:30:59 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0 16/05/15 00:30:59 INFO spark.SparkContext: Invoking stop() from shutdown hook

FrozenWave · ‎05-14-2016

Hi all, my CDH test rig is as follows: CDH 5.5.1 Spark 1.5.0 Oozie 4.1.0 I have successfully created a simple Oozie Workflow that spawns a Spark Action using HUE Interface. My intention is to use Yarn in Cluster mode to run the Workflow/Action. It's a Python script, which is as follows (just a test): from pyspark import SparkConf, SparkContext from pyspark.sql import SQLContext from pyspark.sql import HiveContext from pyspark.sql.functions import * sconf = SparkConf().setAppName("MySpark").set("spark.driver.memory", "1g").setMaster("yarn-cluster") sc = SparkContext(conf=sconf) ### (1) ALTERNATIVELY USE ONE OF THE FOLLOWING CONTEXT DEFINITIONS: sqlCtx = SQLContext(sc) #sqlCtx = HiveContext(sc) ### (2) IF HIVECONTEXT, EVENTUALLY SET THE DATABASE IN USE (SHOULDN'T BE NECESSARY): #sqlCtx .sql("use default") ### (3) CREATE MAIN DATAFRAME. TRY THE SYNTAXES ALTERNATIVELY, COMBINE WITH DIFFERENT (1): #cronologico_DF = sqlCtx.table("sales_fact") cronologico_DF = sqlCtx.sql("select * from sales_fact") ### (4) ANOTHER DATAFRAME extraction_cronologico_DF = cronologico_DF.select("PRODUCT_KEY") ### (5) USELESS PRINT STATEMENT: print 'a' When I run the Workflow, a Mapreduce Job is started. Shortly after, a Spark Job is spawned (I can see that from the Job Browser). The Spark Job fails with the following error (excerpt from the Log File of the Spark Acrion): py4j.protocol.Py4JJavaError: An error occurred while calling o51.sql. : java.lang.RuntimeException: Table Not Found: sales_fact This is my "workflow.xml": <workflow-app name="Churn_2015" xmlns="uri:oozie:workflow:0.5"> <global> <job-xml>hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/hive-site.xml</job-xml> <configuration> <property> <name>oozie.launcher.yarn.app.mapreduce.am.env</name> <value>SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark</value> </property> </configuration> </global> <start to="spark-3ca0"/> <kill name="Kill"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <action name="spark-3ca0"> <spark xmlns="uri:oozie:spark-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <job-xml>hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/hive-site.xml</job-xml> <configuration> <property> <name>oozie.use.system.libpath</name> <value>true</value> </property> </configuration> <master>yarn-cluster</master> <mode>cluster</mode> <name>MySpark</name> <class>org.apache.spark.examples.mllib.JavaALS</class> <jar>hdfs:///user/hue/oozie/workspaces/hue-oozie-1460736691.98/lib/test.py</jar> </spark> <ok to="End"/> <error to="Kill"/> </action> <end name="End"/> </workflow-app> This is my "job.properties": oozie.use.system.libpath=True security_enabled=False dryrun=False jobTracker=<MY_SERVER_FQDN_HERE>:8032 nameNode=hdfs://<MY_SERVER_FQDN_HERE>:8020 Please note that: 1) I've also uploaded "hive-site.xml" in the same directory as the 2 files described above. As you can see from "workflow.xml", it should also be picked up. 2) The "test.py" script is under a "lib" directory in the Workspace created by HUE. It gets picked up. In that directory I also took care of uploading several Jars belonging to some Derby DB Connector, probably required to collect Stats, to avoid other exceptions being throwed. 3) I've tried adding a workflow property "oozie.action.sharelib.for.spark", with value "hcatalog,hive,hive2", with no success 4) As you can see in the Python Script described above, I've been using alternatively an SQLContext or a HiveContext object inside the Script. Results are the same (the error message is slightly different though). 5) ShareLib should be OK too: oozie admin -shareliblist [Available ShareLib] oozie hive distcp hcatalog sqoop mapreduce-streaming spark hive2 pig I'm suspecting the Tables Metastore is not being read, that's probably the issue. But I ran out of ideas and I'm not able to get it working... Thanks in advance for any feedback!

FrozenWave · ‎05-13-2016

I got past this! Still no cigar, though. Now I have another error, but I'm going to work on this. It's something different now... Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, File file:/hdp01/yarn/nm/usercache/admin/appcache/application_1463068686660_0013/container_1463068686660_0013_01_000001/lib/test.py does not exist java.io.FileNotFoundException: File file:/hdp01/yarn/nm/usercache/admin/appcache/application_1463068686660_0013/container_1463068686660_0013_01_000001/lib/test.py does not exist Many thanks for your help. I'd never be able to figure this out by myself!

FrozenWave · ‎05-13-2016

Hi Ben, thanks a whole lot for your reply. May I ask you where exactly you specified that setting? - In the GUI, in some particular field? - In "workflow.xml", in the Job's directory in HDFS? If yes: as an "arg", as a "property", or..? - In "job.properties", in the Job's directory in HDFS? If yes: how? - In some other file? E.g. "/etc/alternatives/spark-conf/spark-defaults.conf"? If yes, how? A snippet of your code would be extremely appreciated! I'm asking you because I've tried all of the above with your suggestion but I did not succeed. Thanks again for your help

FrozenWave · ‎04-16-2016

Hello, I'm using CDH 5.5.1 with Spark 1.5.0. I'm unsuccessfully trying to execute a simple Spark action (Python script) via Oozie. As for now I just want to be able to run something at all, the script is still a silly example, it doesn't really do anything. It is as follows: ## IMOPORT FUNCTIONS from pyspark.sql.functions import * ## CREATE MAIN DATAFRAME eventi_DF = sqlContext.table("eventi") I created a simple Oozie Workflow from Hue GUI. I used the following settings for the Spark action: SPARK MASTER: yarn-cluster MODE: cluster APP NAME: MySpark JARS / PY FILES: lib/test.py MAIN CLASS: org.apache.spark.examples.mllib.JavaALS ARGUMENTS: <No Arguments Defined> I've uploaded the Script in HDFS under the Workspace "/user/hue/oozie/workspaces/hue-oozie-1460736691.98/lib" directory, and I'm sure it gets picked up (as just to understand it was meant to be put in this directory I had to work a little bit, fighting a "test.py" not found" exception, that now is not there anymore). As of now, when I try to run the Workflow by pressing the "Play" button on GUI, this is what I get in the Action Logfile: >>> Invoking Spark class now >>> <<< Invocation of Main class completed <<< Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, key not found: SPARK_HOME java.util.NoSuchElementException: key not found: SPARK_HOME at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:58) at org.apache.spark.deploy.yarn.Client$$anonfun$findPySparkArchives$2.apply(Client.scala:943) at org.apache.spark.deploy.yarn.Client$$anonfun$findPySparkArchives$2.apply(Client.scala:942) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.deploy.yarn.Client.findPySparkArchives(Client.scala:942) at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:630) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:124) at org.apache.spark.deploy.yarn.Client.run(Client.scala:914) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:973) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) at org.apache.oozie.action.hadoop.SparkMain.runSpark(SparkMain.java:185) at org.apache.oozie.action.hadoop.SparkMain.run(SparkMain.java:176) at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:49) at org.apache.oozie.action.hadoop.SparkMain.main(SparkMain.java:46) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:236) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runSubtask(LocalContainerLauncher.java:378) at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runTask(LocalContainerLauncher.java:296) at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.access$200(LocalContainerLauncher.java:181) at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler$1.run(LocalContainerLauncher.java:224) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Oozie Launcher failed, finishing Hadoop job gracefully Oozie Launcher, uploading action data to HDFS sequence file: hdfs://mnethdp01.glistencube.com:8020/user/admin/oozie-oozi/0000000-160416120358569-oozie-oozi-W/spark-3ba6--spark/action-data.seq Oozie Launcher ends Now, I guess the problem is: Failing Oozie Launcher, ... key not found: SPARK_HOME I've been trying hard to set this SPARK_HOME Key in different places. Things I've tried include the following: Modified Spark Config in Cloudera Manager and reloaded the Configuration: Spark Service Environment Advanced Configuration Snippet (Safety Valve): SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark Modified Oozie Config in Cloudera Manager and reloaded the Configuration: Oozie Service Environment Advanced Configuration Snippet (Safety Valve) SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark Used different ways of invoking the Spark Action inside the Spark Action Definition: SPARK MASTER: local[*] SPARK MODE: client SPARK MASTER: yarn-cluster SPARK MODE: cluster SPARK MASTER: yarn-client SPARK MODE: client Modified the "/etc/alternatives/spark-conf/spark-defaults.conf" File manually, adding the following inside it (I did it just on one node, by the way): spark.yarn.appMasterEnv.SPARK_HOME /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark spark.executorEnv.SPARK_HOME /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark All the above to no success. Apparently I'm not able to set the required key anywhere. What am I doing wrong? Isn't this meant to be pretty straightforward? Thanks in advance for any insights.

FrozenWave · ‎03-06-2016

Thanks for the suggestion about registering the class and for the additional info. I have to say that I had already read the link you sent me, but I didn't really get what was the meaning of "you have to register the classes first". In fact, I gave up with my tries at the time and sticked to Java Serializer for my testing purposes. Maybe I'll need to get back to this in the future, and I'll do it with additional knowledge now. Thanks a lot. Also, I have to say that now, after reading all this, I find it a bit strange that Cloudera sets Kryo as default serializer. Anyway.

FrozenWave · ‎02-07-2016

Hi GeoffD, I don't know if what follows is going to help out somehow, but I had the chance to try a few different scenarios lately, and I've noticed that also with Parquet file format a few inconsistencies can occur sometimes. For instance: if you have data coming in from 2 different sources e.g. a Sqoop job creating a Parquet file and a Pig job creating another Parquet output, you have to take care to carefully define the names of the output fields to be all uppercase. Also when you create the Managed Table you have to refer to a "LIKE PARQUET" file with uppercase field names to be inferred. If you don't take care of all that, what I have experienced is that if you take the output files from several Pig and Sqoop procedures and put them under the Table's directory, the "NULL field" issue will happen. Even if the Schemas are apparently identical. What I have found out is that under some conditions (e.g. when you rename fields in a Sqoop or Pig job), the resulting Parquet Files will differ in the fact that the Sqoop job will ALWAYS create Uppercase Field Names, where the corresponding Pig Job does not do that and keeps the exact Case you'd have specified inside the Pig script. Now, I haven't been digging more in deep into all this, as for me getting to this result is enough. But I've started to think that in the new Spark, at some stage, there is something wrong when parsing the Field Names, probably due to a Case issue. So maybe you'd want to try and experiment around this, using your CSV and Plain Text Files too. Maybe you'll come up with a final solution with plain text files too 🙂 HTH

FrozenWave · ‎01-21-2016

Hi, I'm experimenting with a small CDH virtual cluster and I'm facing issues with serializers. More specifically, I'm trying things with the "pyspark.mllib.fpm.FPGrowth" class (Machine Learning). This is what I'm trying to do: from pyspark.mllib.fpm import FPGrowth fp_model = FPGrowth.train(my_words_DF,0.3) freqset = sorted(fp_model.freqItemsets().collect()) At this point, according to what Serializer I have configured in Spark, I have 2 different outcomes. If in "Cloudera Manager --> Spark --> Configuration --> Spark Data Serializer" I configure "org.apache.spark.serializer.JavaSerializer" then everyhting works FINE. If in "Cloudera Manager --> Spark --> Configuration --> Spark Data Serializer" I configure "org.apache.spark.serializer.KryoSerializer" (which is the DEFAULT setting, by the way), when I collect the "freqItemsets" I get the following exception: com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: Can not set final scala.collection.mutable.ListBuffer field org.apache.spark.mllib.fpm.FPTree$Summary.nodes to scala.collection.mutable.ArrayBuffer This exception is confirmed to be a consequence of an unresolved bug "using Kryo with FPGrowth" in the following thread: https://issues.apache.org/jira/browse/SPARK-7483 Now my questions: 1) What is really the difference between Kryo and Java Serializers? 2) Is one Serializer definitely better in most use cases, and if yes which one? 3) Is it possible to dinamically switch between the 2 Serializers without exiting my Spark session and/or changing/redeploying Spark Configuration in Cloudera Manager? How? Thank in advance for any insight

FrozenWave · ‎01-06-2016

Finally found a solution (or a workaround, don't know exactly how to call it... As apparently what I described above was not my fault, in the end, but was due to something that probably broke up since Spark 1.3.0). Using "Parquet" storage file format apparently does the trick. It works smooth and with no warnings. I think I'm going to use this file format from now on... At least until they fix the Partitioned Avro thing. Hope this helps somebody in the future 🙂

Online	Offline
Last Visited	‎07-26-2024 09:18 PM

Member Since	‎01-05-2016 04:28 AM
Last Visited	‎07-26-2024 09:18 PM
Posts	55
Kudos received	37

Cloudera Community

Re: I would to write a script which will check the...

Re: Running shell scripts in oozie using hue

Re: Oozie Workflows problems after upgrading from ...

Re: Oozie workflow, Spark action (using simple Dat...

Re: adding a Kafka service failed

Re: adding a Kafka service failed

Re: Oozie workflow, Spark action (using simple Dat...

Oozie workflow, Spark action (using simple Datafra...

Re: "Failing Oozie Launcher - key not found: SPARK...

Re: "Failing Oozie Launcher - key not found: SPARK...

"Failing Oozie Launcher - key not found: SPARK_HOM...

Re: JavaSerializer vs. KryoSerializer

Re: Pyspark: Table Dataframe returning empty recor...

JavaSerializer vs. KryoSerializer

Re: Pyspark: Table Dataframe returning empty recor...