Support Questions

Find answers, ask questions, and share your expertise

Run Oryx on a machine that is not part of the cluster

avatar
Explorer

Hi,

I am trying to run Oryx on a machine that is not part of the cluster...

 

 

My setting for the oryx.conf is as below (about the Hadoop/HDFS settings)... Is that a right setting ?

Is there something else I need to set for the oryx.conf

 

model=${als-model}
model.instance-dir=hdfs://name_node:8020/oryx_data
model.local-computation=false
model.local-data=false
 
 
 
Thanks.
 
 
 
1 ACCEPTED SOLUTION

avatar
Master Collaborator

That's fine. The machine needs to be able to communicate with the cluster of course. Usually you would make the Hadoop configuration visible as well and point to it with HADOOP_CONF_DIR. I think that will be required to get MapReduce to work.

View solution in original post

23 REPLIES 23

avatar
Master Collaborator

It's pretty likely. It would not be in the logs but in the error shown on the attempt's (dead) container's info screen in the history server. At least, I saw the same thing exactly and this resolved it, and I can sort of see why this is now a problem in Java 7.

avatar
Explorer

Sean,

 

I applied your changes to our code base and still seeing the similar error (as below).

I checked the job by using the job tracking URL (e.g., http://server105:8088/proxy/application_1432750221048_0525/)

and actually there is no failed attempt.

 

/// Logs ////

Thu May 28 07:27:57 PDT 2015 INFO Running job "Oryx-/user/xyz/int/def-1-122-Y-RowStep: Avro(hdfs://server105:8020/u... ID=1 (1/1)"
Thu May 28 07:27:57 PDT 2015 INFO Job status available at: http://server105:8088/proxy/application_1432750221048_0525/
Thu May 28 07:29:14 PDT 2015 INFO Retrying connect to server: server104/10.190.36.114:40915. Already tried 0 time(s); maxRetries=3

Thu May 28 07:29:14 PDT 2015 INFO Retrying connect to server: server104/10.190.36.114:40915. Already tried 1 time(s); maxRetries=3

Thu May 28 07:29:14 PDT 2015 INFO Retrying connect to server: server104/10.190.36.114:40915. Already tried 2 time(s); maxRetries=3
...
Thu May 28 07:34:15 PDT 2015 INFO Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
Thu May 28 07:34:16 PDT 2015 INFO Finished Oryx-/user/xyz/int/def-1-122-Y-RowStep
Thu May 28 07:34:16 PDT 2015 INFO Completed RowStep in 379s

 

 

avatar
Master Collaborator

Just to check, you have this commit right?

 

https://github.com/cloudera/oryx/commit/4b5e557a36f3d666bab0befc21b79efdf1fcd52d

 

The symptom here is that the App Master for the MR job dies straight away, and can't be contacted. The important thing is to know why. For example when I looked at the AM app screen (i.e. http://[host]:8088/cluster/app/application_1435553713675_0018) I saw something like ...

 

Application application_1435553713675_0018 failed 2 times due to AM Container for appattempt_1435553713675_0018_000002 exited with exitCode: -104
For more detailed output, check application tracking page:http://[host]:8088/proxy/application_1435553713675_0018/Then, click on links to logs of each attempt.
Diagnostics: Container [pid=13840,containerID=container_1435553713675_0018_02_000001] is running beyond physical memory limits. Current usage: 421.5 MB of 384 MB physical memory used; 2.7 GB of 806.4 MB virtual memory used. Killing container.
...
 
Do you see anything like that that says why the AM stopped?

avatar
Explorer


Yes, I applied your commit...

I went to an example
http://[host]:8088/cluster/app/application_1435263631757_19721
But, I still not seeing the error.

As I mentioned, the job/task is not really got killed or stopped. It just dropped some retrying info (as below), but it continues

Thu May 28 07:29:14 PDT 2015 INFO Retrying connect to server: server104/10.190.36.114:40915. Already tried 0 time(s); maxRetries=3
Thu May 28 07:29:14 PDT 2015 INFO Retrying connect to server: server104/10.190.36.114:40915. Already tried 1 time(s); maxRetries=3

avatar
Master Collaborator

Yes but the question is why. This is just a message from the driver program saying the master can't be found. The question is what happened to the Application Master. If you find it in YARN, can you see what happened to that container? it almost surely failed to start but why?

avatar
Explorer

Sean,

 

I am not sure why.

But, it seems relating to firewall.

Our Oryx server is running in a virtiual Lan to talk to another virtual Lan firewall-ed.

It looks the dynamic port is because of ephemeral port and a bug

https://issues.apache.org/jira/browse/MAPREDUCE-6338

 

Still digging this issue.

 

 

avatar
Master Collaborator

Yes that could also be a cause. Is it possible to run the process inside the firewall? certainly the MapReduce jobs are intended to be managed by the Computation Layer from within the cluster.

avatar
Explorer

you had talk about many issues above, but I find it more related to oryx 1 and MR2.

I wonder whether it possible to run oryx2 outside a CDH cluster?

 

I deployed a hadoop2.6.0-CDH-5.4.4 cluster with zookeeper, kafka , spark on yarn and hdfs.

After I tried to run oryx2 on my laptop outside the cluster above(the same CDH version deployed but not running ),

 batch layer didn't print out as expected:

2015-08-20 23:45:39,278 INFO  BatchLayer:82 Creating message stream from topic
2015-08-20 23:45:39,531 INFO  AbstractSparkLayer:224 Initial offsets: {[OryxInput,0]=21642186}
2015-08-20 23:45:39,610 INFO  BatchLayer:117 Starting Spark Streaming
2015-08-20 23:45:39,677 INFO  BatchLayer:124 Spark Streaming is running

 

and it printed out exception at last :

Exception in thread "main" java.net.ConnectException: Call From m4040/192.168.88.46 to 0.0.0.0:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

 

On batch and speed web page, it showed like this:

 

batch.png

 

 

I guess my laptop could not communicate with kafka on cluster and this oryx job was rejected by yarn ?!

 

avatar
Master Collaborator

You can run the binaries on any machine that can see the Hadoop configuration on the classpath, and which can access all of the services it needs to in the cluster. There are a number of services to talk to: HDFS, YARN, Kafka, Spark and the app's executors. So in general you'd have to have a lot of ports open, and at that point your machine is effectively a gateway node in the cluster. Certainly it's meant to be run within the cluster.

 

The serving layer only needs access to Kafka, and that's by design, so it might more easily run outside the cluster.

avatar
Explorer

Sean,

 

I tried to run Oryx in a node that in the same LAN as the Hadoop cluster.

We tested Oryx 1 fine without problems (we used to have firewall issue. After moving node to the same LAN as Hadoop cluster,

it runs fine)....

 

We just start to test Oryx 2, using the same network (that's, no firewall issues).

I do have the /etc/hafoop/config in the node I am running Oryx 2.

However; I got the following errors when starting Oryx 2 batch layer..

It looks it's looking for cloudera CDH jar files... Any thought? I need to copy the jar files over ?

 

errors:

ls: cannot access /opt/cloudera/parcels/CDH/jars/zookeeper-*.jar: No such file or directory
ls: cannot access /opt/cloudera/parcels/CDH/jars/spark-assembly-*.jar: No such file or directory

 

Thanks.

Jason