question Re: Run Oryx on a machine that is not part of the cluster in Archives of Support Questions (Read Only)

Run Oryx on a machine that is not part of the cluster

Jason.Chen — Fri, 16 Sep 2022 15:39:34 GMT

Hi,

I am trying to run Oryx on a machine that is not part of the cluster...

My setting for the oryx.conf is as below (about the Hadoop/HDFS settings)... Is that a right setting ?

Is there something else I need to set for the oryx.conf

model=${als-model}

model.instance-dir=hdfs://name_node:8020/oryx_data

model.local-computation=false

model.local-data=false

Thanks.

Re: Run Oryx on a machine that is not part of the cluster

srowen — Tue, 23 Dec 2014 08:47:11 GMT

That's fine. The machine needs to be able to communicate with the cluster of course. Usually you would make the Hadoop configuration visible as well and point to it with HADOOP_CONF_DIR. I think that will be required to get MapReduce to work.

Re: Run Oryx on a machine that is not part of the cluster

Jason.Chen — Fri, 01 May 2015 05:58:13 GMT

Sean,

A follow up question:

I want to know when Oryx will update the information obtained from the Hadoop configure files.

I mean, when Oryx computation and serving layers start, the Hadoop config files are read.

Then, if there are changes for Hadoop configure files, should I restart Oryx computation and serving layers in order to get updated config files ?

In other words, when Oryx computation and serving layers read Hadoop configuration files ?

Thanks.

Re: Run Oryx on a machine that is not part of the cluster

srowen — Fri, 01 May 2015 07:33:25 GMT

Yes, it reads them at startup, so you would need to restart the processes.

Re: Run Oryx on a machine that is not part of the cluster

Jason.Chen — Sun, 28 Jun 2015 18:23:55 GMT

Sean,

I have some follow-up questions regarding this topic ("Run Oryx on a machine that is not part of the cluster")..

We started to test the case that Oryx 1.0 computation/serving layers running on VMs that are in different virtual LAN from the Hadoop

Cluster. There are firewall port issues for the communication between the two virtual LANs.

Therefore, we opened the all the Hadoop used ports on the Hadoop Cluster virtual LAN, so that the Oryx VMs can talk to it.

We got the "Hadoop used port list" from both the Hadoop configuration files and also some online Cloudera CDH port info.

After doing that, yes, Oryx is able to submit jobs to Hadoop cluster at some level. However, it still drops some communications issue.

For example, from the Oryx log, I see something like this

Retrying connect to server: server-name/10.190.36.113:40651. Already tried 0 time(s); maxRetries=3

Retrying connect to server: server-name/10.190.36.113:40651. Already tried 1 time(s); maxRetries=3

....

Retrying connect to server: server-name/10.190.36.114:40915. Already tried 0 time(s); maxRetries=3

Retrying connect to server: server-name/10.190.36.114:40915. Already tried 1 time(s); maxRetries=3

....

I dig into the codes and I do not understand why this could happen. My questions...

(1) Is the communication between Oryx to Hadoop is bidirectional OR unidirectional?

My understanding is that Oryx uses the Hadoop configuration files to get the idea where (server and port) it should submit the jobs.

After Oryx submits the job, how Oryx knows the job is completed? Does Oryx check with Hadoop to get the status? Or, Hadoop

communicates back to Oryx VM regarding the status ?

(2) Related to (1) and the log info I post above: Are there "dynamic" ports are used during the Oryx-Hadoop communications? From the log

message, I see ports 40651 and 40915.. They seem to not standard Hadoop ports and even these port numbers are dynamically changing.

This is confusing.

Thanks.

Re: Run Oryx on a machine that is not part of the cluster

srowen — Sun, 28 Jun 2015 18:33:22 GMT

It doesn't do any communciation of its own; this is all traffic to/from the Hadoop cluster for HDFS and YARN. Hadoop has no idea about the oryx process. It should be dead simple in this regard. I don't think those are well-known ports so maybe this is it trying to talk to the YARN app that runs the MapReduce? what is failing at that point?

I would expect the serving to be more predictable as it only needs to talk to HDFS and those daemons should be on well known ports. In any event it's "just" standard Hadoop mechanisms here, which may mean you can ask support for assistance about how to constrain the ports that are used? but in general the computation layer needs to be close to the cluster and is intended to be inside its firewall.

Re: Run Oryx on a machine that is not part of the cluster

Jason.Chen — Mon, 29 Jun 2015 06:03:47 GMT

It drops the connection issue in RowStep..

One example of the detailed log is as below (I slightly modified the info to hide some sensitive server info, but it keeps main messages):

One thing odd is that even it reports cannot reaching the server:port (say 10.190.36.114:40915) as below, it's still

eventually completing the job. I am thinking maybe it completes with other nodes in a "standard" port?

However, it's still not good a sign seeing cannot connect to server, because it introduces unnecessary running time.

/// Logs ////

Thu May 28 07:27:57 PDT 2015 INFO Running job "Oryx-/user/xyz/int/def-1-122-Y-RowStep: Avro(hdfs://server105:8020/u... ID=1 (1/1)"
Thu May 28 07:27:57 PDT 2015 INFO Job status available at: http://server105:8088/proxy/application_1432750221048_0525/
Thu May 28 07:29:14 PDT 2015 INFO Retrying connect to server: server104/10.190.36.114:40915. Already tried 0 time(s); maxRetries=3

Thu May 28 07:29:14 PDT 2015 INFO Retrying connect to server: server104/10.190.36.114:40915. Already tried 1 time(s); maxRetries=3

Thu May 28 07:29:14 PDT 2015 INFO Retrying connect to server: server104/10.190.36.114:40915. Already tried 2 time(s); maxRetries=3
...
Thu May 28 07:34:15 PDT 2015 INFO Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
Thu May 28 07:34:16 PDT 2015 INFO Finished Oryx-/user/xyz/int/def-1-122-Y-RowStep
Thu May 28 07:34:16 PDT 2015 INFO Completed RowStep in 379s

Re: Run Oryx on a machine that is not part of the cluster

srowen — Mon, 29 Jun 2015 06:29:40 GMT

I see the same thing now. I bet that if you click through to the failed container you see an error like

Diagnostics: Container [pid=13840,containerID=container_1435553713675_0018_02_000001] is running beyond physical memory limits. Current usage: 421.5 MB of 384 MB physical memory used; 2.7 GB of 806.4 MB virtual memory used. Killing container.

If so then at least we have the cause. I see what is failing but not yet why as there's not a good reason the AM would only be allowed 384MB. It's a YARN config thing somewhere.

Re: Run Oryx on a machine that is not part of the cluster

srowen — Mon, 29 Jun 2015 07:18:03 GMT

This is the problem; fix coming momentarily:

https://github.com/cloudera/oryx/issues/114

I never saw a Snappy issue. I'm on CDH 5.4.2. Right now it seems to be running OK after the above.

Re: Run Oryx on a machine that is not part of the cluster

Jason.Chen — Mon, 29 Jun 2015 07:37:14 GMT

Interesting..

Is that actually the source of the problem ?

I checked my log and there are no container errors info.

As I mentioned previously, the job did complete, but it complains cannot reach some servers during the process.

Re: Run Oryx on a machine that is not part of the cluster

srowen — Mon, 29 Jun 2015 07:41:04 GMT

It's pretty likely. It would not be in the logs but in the error shown on the attempt's (dead) container's info screen in the history server. At least, I saw the same thing exactly and this resolved it, and I can sort of see why this is now a problem in Java 7.

Re: Run Oryx on a machine that is not part of the cluster

Jason.Chen — Wed, 01 Jul 2015 15:34:55 GMT

Sean,

I applied your changes to our code base and still seeing the similar error (as below).

I checked the job by using the job tracking URL (e.g., http://server105:8088/proxy/application_1432750221048_0525/)

and actually there is no failed attempt.

/// Logs ////

Thu May 28 07:29:14 PDT 2015 INFO Retrying connect to server: server104/10.190.36.114:40915. Already tried 1 time(s); maxRetries=3

Re: Run Oryx on a machine that is not part of the cluster

srowen — Wed, 01 Jul 2015 16:34:40 GMT

Just to check, you have this commit right?

https://github.com/cloudera/oryx/commit/4b5e557a36f3d666bab0befc21b79efdf1fcd52d

The symptom here is that the App Master for the MR job dies straight away, and can't be contacted. The important thing is to know why. For example when I looked at the AM app screen (i.e. http://[host]:8088/cluster/app/application_1435553713675_0018) I saw something like ...

Application application_1435553713675_0018 failed 2 times due to AM Container for appattempt_1435553713675_0018_000002 exited with exitCode: -104

For more detailed output, check application tracking page:http://[host]:8088/proxy/application_1435553713675_0018/Then, click on links to logs of each attempt.

...

Do you see anything like that that says why the AM stopped?

Re: Run Oryx on a machine that is not part of the cluster

Jason.Chen — Thu, 02 Jul 2015 00:37:48 GMT

Yes, I applied your commit...

I went to an example
http://[host]:8088/cluster/app/application_1435263631757_19721
But, I still not seeing the error.

As I mentioned, the job/task is not really got killed or stopped. It just dropped some retrying info (as below), but it continues

Thu May 28 07:29:14 PDT 2015 INFO Retrying connect to server: server104/10.190.36.114:40915. Already tried 0 time(s); maxRetries=3
Thu May 28 07:29:14 PDT 2015 INFO Retrying connect to server: server104/10.190.36.114:40915. Already tried 1 time(s); maxRetries=3

Re: Run Oryx on a machine that is not part of the cluster

srowen — Thu, 02 Jul 2015 07:14:47 GMT

Yes but the question is why. This is just a message from the driver program saying the master can't be found. The question is what happened to the Application Master. If you find it in YARN, can you see what happened to that container? it almost surely failed to start but why?

Re: Run Oryx on a machine that is not part of the cluster

Jason.Chen — Mon, 06 Jul 2015 05:19:23 GMT

Sean,

I am not sure why.

But, it seems relating to firewall.

Our Oryx server is running in a virtiual Lan to talk to another virtual Lan firewall-ed.

It looks the dynamic port is because of ephemeral port and a bug

https://issues.apache.org/jira/browse/MAPREDUCE-6338

Still digging this issue.

Re: Run Oryx on a machine that is not part of the cluster

srowen — Mon, 06 Jul 2015 07:51:25 GMT

Yes that could also be a cause. Is it possible to run the process inside the firewall? certainly the MapReduce jobs are intended to be managed by the Computation Layer from within the cluster.

Re: Run Oryx on a machine that is not part of the cluster

horatio — Fri, 21 Aug 2015 02:54:02 GMT

you had talk about many issues above, but I find it more related to oryx 1 and MR2.

I wonder whether it possible to run oryx2 outside a CDH cluster?

I deployed a hadoop2.6.0-CDH-5.4.4 cluster with zookeeper, kafka , spark on yarn and hdfs.

After I tried to run oryx2 on my laptop outside the cluster above(the same CDH version deployed but not running ),

batch layer didn't print out as expected:

2015-08-20 23:45:39,278 INFO BatchLayer:82 Creating message stream from topic
2015-08-20 23:45:39,531 INFO AbstractSparkLayer:224 Initial offsets: {[OryxInput,0]=21642186}
2015-08-20 23:45:39,610 INFO BatchLayer:117 Starting Spark Streaming
2015-08-20 23:45:39,677 INFO BatchLayer:124 Spark Streaming is running

and it printed out exception at last :

Exception in thread "main" java.net.ConnectException: Call From m4040/192.168.88.46 to 0.0.0.0:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused

On batch and speed web page, it showed like this:

I guess my laptop could not communicate with kafka on cluster and this oryx job was rejected by yarn ?!

Re: Run Oryx on a machine that is not part of the cluster

srowen — Fri, 21 Aug 2015 08:58:27 GMT

You can run the binaries on any machine that can see the Hadoop configuration on the classpath, and which can access all of the services it needs to in the cluster. There are a number of services to talk to: HDFS, YARN, Kafka, Spark and the app's executors. So in general you'd have to have a lot of ports open, and at that point your machine is effectively a gateway node in the cluster. Certainly it's meant to be run within the cluster.

The serving layer only needs access to Kafka, and that's by design, so it might more easily run outside the cluster.

Re: Run Oryx on a machine that is not part of the cluster

JasonChen — Sun, 23 Aug 2015 02:19:16 GMT

Sean,

I tried to run Oryx in a node that in the same LAN as the Hadoop cluster.

We tested Oryx 1 fine without problems (we used to have firewall issue. After moving node to the same LAN as Hadoop cluster,

it runs fine)....

We just start to test Oryx 2, using the same network (that's, no firewall issues).

I do have the /etc/hafoop/config in the node I am running Oryx 2.

However; I got the following errors when starting Oryx 2 batch layer..

It looks it's looking for cloudera CDH jar files... Any thought? I need to copy the jar files over ?

errors:

ls: cannot access /opt/cloudera/parcels/CDH/jars/zookeeper-*.jar: No such file or directory
ls: cannot access /opt/cloudera/parcels/CDH/jars/spark-assembly-*.jar: No such file or directory

Thanks.

Jason