Support Questions

Jason.Chen · ‎12-22-2014

Hi,

I am trying to run Oryx on a machine that is not part of the cluster...

My setting for the oryx.conf is as below (about the Hadoop/HDFS settings)... Is that a right setting ?

Is there something else I need to set for the oryx.conf

model=${als-model}

model.instance-dir=hdfs://name_node:8020/oryx_data

model.local-computation=false

model.local-data=false

Thanks.

srowen · ‎12-23-2014

That's fine. The machine needs to be able to communicate with the cluster of course. Usually you would make the Hadoop configuration visible as well and point to it with HADOOP_CONF_DIR. I think that will be required to get MapReduce to work.

View solution in original post

srowen · ‎12-23-2014

That's fine. The machine needs to be able to communicate with the cluster of course. Usually you would make the Hadoop configuration visible as well and point to it with HADOOP_CONF_DIR. I think that will be required to get MapReduce to work.

Jason.Chen · ‎04-30-2015

Sean,

A follow up question:

I want to know when Oryx will update the information obtained from the Hadoop configure files.

I mean, when Oryx computation and serving layers start, the Hadoop config files are read.

Then, if there are changes for Hadoop configure files, should I restart Oryx computation and serving layers in order to get updated config files ?

In other words, when Oryx computation and serving layers read Hadoop configuration files ?

Thanks.

srowen · ‎05-01-2015

Yes, it reads them at startup, so you would need to restart the processes.

Jason.Chen · ‎06-28-2015

Sean,

I have some follow-up questions regarding this topic ("Run Oryx on a machine that is not part of the cluster")..

We started to test the case that Oryx 1.0 computation/serving layers running on VMs that are in different virtual LAN from the Hadoop

Cluster. There are firewall port issues for the communication between the two virtual LANs.

Therefore, we opened the all the Hadoop used ports on the Hadoop Cluster virtual LAN, so that the Oryx VMs can talk to it.

We got the "Hadoop used port list" from both the Hadoop configuration files and also some online Cloudera CDH port info.

After doing that, yes, Oryx is able to submit jobs to Hadoop cluster at some level. However, it still drops some communications issue.

For example, from the Oryx log, I see something like this

Retrying connect to server: server-name/10.190.36.113:40651. Already tried 0 time(s); maxRetries=3

Retrying connect to server: server-name/10.190.36.113:40651. Already tried 1 time(s); maxRetries=3

....

Retrying connect to server: server-name/10.190.36.114:40915. Already tried 0 time(s); maxRetries=3

Retrying connect to server: server-name/10.190.36.114:40915. Already tried 1 time(s); maxRetries=3

....

I dig into the codes and I do not understand why this could happen. My questions...

(1) Is the communication between Oryx to Hadoop is bidirectional OR unidirectional?

My understanding is that Oryx uses the Hadoop configuration files to get the idea where (server and port) it should submit the jobs.

After Oryx submits the job, how Oryx knows the job is completed? Does Oryx check with Hadoop to get the status? Or, Hadoop

communicates back to Oryx VM regarding the status ?

(2) Related to (1) and the log info I post above: Are there "dynamic" ports are used during the Oryx-Hadoop communications? From the log

message, I see ports 40651 and 40915.. They seem to not standard Hadoop ports and even these port numbers are dynamically changing.

This is confusing.

Thanks.

srowen · ‎06-28-2015

It doesn't do any communciation of its own; this is all traffic to/from the Hadoop cluster for HDFS and YARN. Hadoop has no idea about the oryx process. It should be dead simple in this regard. I don't think those are well-known ports so maybe this is it trying to talk to the YARN app that runs the MapReduce? what is failing at that point?

I would expect the serving to be more predictable as it only needs to talk to HDFS and those daemons should be on well known ports. In any event it's "just" standard Hadoop mechanisms here, which may mean you can ask support for assistance about how to constrain the ports that are used? but in general the computation layer needs to be close to the cluster and is intended to be inside its firewall.

Jason.Chen · ‎06-28-2015

It drops the connection issue in RowStep..

One example of the detailed log is as below (I slightly modified the info to hide some sensitive server info, but it keeps main messages):

One thing odd is that even it reports cannot reaching the server:port (say 10.190.36.114:40915) as below, it's still

eventually completing the job. I am thinking maybe it completes with other nodes in a "standard" port?

However, it's still not good a sign seeing cannot connect to server, because it introduces unnecessary running time.

/// Logs ////

Thu May 28 07:27:57 PDT 2015 INFO Running job "Oryx-/user/xyz/int/def-1-122-Y-RowStep: Avro(hdfs://server105:8020/u... ID=1 (1/1)"
Thu May 28 07:27:57 PDT 2015 INFO Job status available at: http://server105:8088/proxy/application_1432750221048_0525/
Thu May 28 07:29:14 PDT 2015 INFO Retrying connect to server: server104/10.190.36.114:40915. Already tried 0 time(s); maxRetries=3

Thu May 28 07:29:14 PDT 2015 INFO Retrying connect to server: server104/10.190.36.114:40915. Already tried 1 time(s); maxRetries=3

Thu May 28 07:29:14 PDT 2015 INFO Retrying connect to server: server104/10.190.36.114:40915. Already tried 2 time(s); maxRetries=3
...
Thu May 28 07:34:15 PDT 2015 INFO Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
Thu May 28 07:34:16 PDT 2015 INFO Finished Oryx-/user/xyz/int/def-1-122-Y-RowStep
Thu May 28 07:34:16 PDT 2015 INFO Completed RowStep in 379s

srowen · ‎06-28-2015

I see the same thing now. I bet that if you click through to the failed container you see an error like

Diagnostics: Container [pid=13840,containerID=container_1435553713675_0018_02_000001] is running beyond physical memory limits. Current usage: 421.5 MB of 384 MB physical memory used; 2.7 GB of 806.4 MB virtual memory used. Killing container.

If so then at least we have the cause. I see what is failing but not yet why as there's not a good reason the AM would only be allowed 384MB. It's a YARN config thing somewhere.

srowen · ‎06-29-2015

This is the problem; fix coming momentarily:

https://github.com/cloudera/oryx/issues/114

I never saw a Snappy issue. I'm on CDH 5.4.2. Right now it seems to be running OK after the above.

Jason.Chen · ‎06-29-2015

Interesting..

Is that actually the source of the problem ?

I checked my log and there are no container errors info.

As I mentioned previously, the job did complete, but it complains cannot reach some servers during the process.